tokenize Package¶

`indic_tokenize` Module¶

Tokenizer for Indian languages. Currently, simple punctuation-based tokenizers are supported (see trivial_tokenize). Major Indian language punctuations are handled.

indicnlp.tokenize.indic_tokenize.trivial_tokenize(text, lang='hi')[source]¶

trivial tokenizer for Indian languages using Brahmi for Arabic scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. Major punctuations specific to Indian langauges are handled. These punctuations characters were identified from the Unicode database.

Parameters:	text (str) – text to tokenize lang (str) – ISO 639-2 language code
Returns:	list of tokens
Return type:	list

indicnlp.tokenize.indic_tokenize.trivial_tokenize_indic(text)[source]¶

tokenize string for Indian language scripts using Brahmi-derived scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). This is a language independent tokenizer

Parameters:	text (str) – text to tokenize
Returns:	list of tokens
Return type:	list

indicnlp.tokenize.indic_tokenize.trivial_tokenize_urdu(text)[source]¶

tokenize Urdu string

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Urdu script. These punctuations characters were identified from the Unicode database for Arabic script by looking for punctuation symbols.

Parameters:	text (str) – text to tokenize
Returns:	list of tokens
Return type:	list

`indic_detokenize` Module¶

De-tokenizer for Indian languages.

indicnlp.tokenize.indic_detokenize.trivial_detokenize(text, lang='hi')[source]¶

detokenize string for languages of the Indian subcontinent

A trivial detokenizer which:

decides whether punctuation attaches to left/right or both

handles number sequences

handles quotes smartly (deciding left or right attachment)

Parameters:	text (str) – tokenized text to process
Returns:	detokenized string
Return type:	str
Raises:	`IndicNlpException` – If language is not supported

indicnlp.tokenize.indic_detokenize.trivial_detokenize_indic(text)[source]¶

detokenize string for Indian language scripts using Brahmi-derived scripts

A trivial detokenizer which:

decides whether punctuation attaches to left/right or both

handles number sequences

handles quotes smartly (deciding left or right attachment)

Parameters:	text (str) – tokenized text to process
Returns:	detokenized string
Return type:	str

`sentence_tokenize` Module¶

Sentence splitter for Indian languages. Contains a rule-based sentence splitter that can understand common non-breaking phrases in many Indian languages.

indicnlp.tokenize.sentence_tokenize.is_acronym_abbvr(text, lang)[source]¶

Is the text a non-breaking phrase

Parameters:	text (str) – text to check for non-breaking phrase lang (str) – ISO 639-2 language code
Returns:	true if text is a non-breaking phrase
Return type:	boolean

indicnlp.tokenize.sentence_tokenize.is_latin_or_numeric(character)[source]¶

Check if a character is a Latin character (uppercase or lowercase) or a number.

Parameters:	character (str) – The character to be checked.
Returns:	True if the character is a Latin character or a number, False otherwise.
Return type:	bool

indicnlp.tokenize.sentence_tokenize.sentence_split(text, lang, delim_pat='auto')[source]¶

split the text into sentences

A rule-based sentence splitter for Indian languages written in Brahmi-derived scripts. The text is split at sentence delimiter boundaries. The delimiters can be configured by passing appropriate parameters.

The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages.

Parameters:	text (str) – text to split into sentence lang (str) – ISO 639-2 language code delim_pat (str) – regular expression to identify sentence delimiter characters. If set to ‘auto’, the delimiter pattern is chosen automatically based on the language and text.
Returns:	list of sentences identified from the input text
Return type:	list

tokenize Package¶

indic_tokenize Module¶

indic_detokenize Module¶

sentence_tokenize Module¶

`indic_tokenize` Module¶

`indic_detokenize` Module¶

`sentence_tokenize` Module¶