tokenize Package

indic_tokenize Module

Tokenizer for Indian languages. Currently, simple punctuation-based tokenizers are supported (see trivial_tokenize). Major Indian language punctuations are handled.

indicnlp.tokenize.indic_tokenize.trivial_tokenize(text, lang='hi')[source]

trivial tokenizer for Indian languages using Brahmi for Arabic scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. Major punctuations specific to Indian langauges are handled. These punctuations characters were identified from the Unicode database.

Parameters:
  • text (str) – text to tokenize
  • lang (str) – ISO 639-2 language code
Returns:

list of tokens

Return type:

list

indicnlp.tokenize.indic_tokenize.trivial_tokenize_indic(text)[source]

tokenize string for Indian language scripts using Brahmi-derived scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). This is a language independent tokenizer

Parameters:text (str) – text to tokenize
Returns:list of tokens
Return type:list
indicnlp.tokenize.indic_tokenize.trivial_tokenize_urdu(text)[source]

tokenize Urdu string

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Urdu script. These punctuations characters were identified from the Unicode database for Arabic script by looking for punctuation symbols.

Parameters:text (str) – text to tokenize
Returns:list of tokens
Return type:list

indic_detokenize Module

De-tokenizer for Indian languages.

indicnlp.tokenize.indic_detokenize.trivial_detokenize(text, lang='hi')[source]

detokenize string for languages of the Indian subcontinent

A trivial detokenizer which:

  • decides whether punctuation attaches to left/right or both
  • handles number sequences
  • handles quotes smartly (deciding left or right attachment)
Parameters:text (str) – tokenized text to process
Returns:detokenized string
Return type:str
Raises:IndicNlpException – If language is not supported
indicnlp.tokenize.indic_detokenize.trivial_detokenize_indic(text)[source]

detokenize string for Indian language scripts using Brahmi-derived scripts

A trivial detokenizer which:

  • decides whether punctuation attaches to left/right or both
  • handles number sequences
  • handles quotes smartly (deciding left or right attachment)
Parameters:text (str) – tokenized text to process
Returns:detokenized string
Return type:str

sentence_tokenize Module

Sentence splitter for Indian languages. Contains a rule-based sentence splitter that can understand common non-breaking phrases in many Indian languages.

indicnlp.tokenize.sentence_tokenize.is_acronym_abbvr(text, lang)[source]

Is the text a non-breaking phrase

Parameters:
  • text (str) – text to check for non-breaking phrase
  • lang (str) – ISO 639-2 language code
Returns:

true if text is a non-breaking phrase

Return type:

boolean

indicnlp.tokenize.sentence_tokenize.is_latin_or_numeric(character)[source]

Check if a character is a Latin character (uppercase or lowercase) or a number.

Parameters:character (str) – The character to be checked.
Returns:True if the character is a Latin character or a number, False otherwise.
Return type:bool
indicnlp.tokenize.sentence_tokenize.sentence_split(text, lang, delim_pat='auto')[source]

split the text into sentences

A rule-based sentence splitter for Indian languages written in Brahmi-derived scripts. The text is split at sentence delimiter boundaries. The delimiters can be configured by passing appropriate parameters.

The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages.

Parameters:
  • text (str) – text to split into sentence
  • lang (str) – ISO 639-2 language code
  • delim_pat (str) – regular expression to identify sentence delimiter characters. If set to ‘auto’, the delimiter pattern is chosen automatically based on the language and text.
Returns:

list of sentences identified from the input text

Return type:

list