indicnlp Package

common Module

exception indicnlp.common.IndicNlpException(msg)[source]

Bases: Exception

Exceptions thrown by Indic NLP Library components are instances of this class. ‘msg’ attribute contains exception details.

indicnlp.common.get_resources_path()[source]

Get the path to the Indic NLP Resources directory

indicnlp.common.init()[source]

Initialize the module. The following actions are performed:

  • Checks of INDIC_RESOURCES_PATH variable is set. If not, checks if it can beb initialized from
    INDIC_RESOURCES_PATH environment variable. If that fails, an exception is raised
indicnlp.common.set_resources_path(resources_path)[source]

Set the path to the Indic NLP Resources directory

langinfo Module

indicnlp.langinfo.get_offset(c, lang)[source]

Applicable to Brahmi derived Indic scripts

indicnlp.langinfo.in_coordinated_range(c_offset)[source]

Applicable to Brahmi derived Indic scripts

indicnlp.langinfo.is_approximant(c, lang)[source]

Is the character an approximant consonant

indicnlp.langinfo.is_approximant_offset(c_offset)[source]

Is the offset an approximant consonant

indicnlp.langinfo.is_aspirated(c, lang)[source]

Is the character a aspirated consonant

indicnlp.langinfo.is_aspirated_offset(c_offset)[source]

Is the offset a aspirated consonant

indicnlp.langinfo.is_aum(c, lang)[source]

Is the character a vowel sign (maatraa)

indicnlp.langinfo.is_aum_offset(c_offset)[source]

Is the offset a vowel sign (maatraa)

indicnlp.langinfo.is_consonant(c, lang)[source]

Is the character a consonant

indicnlp.langinfo.is_consonant_offset(c_offset)[source]

Is the offset a consonant

indicnlp.langinfo.is_danda_delim(lang)[source]

Returns True if danda/double danda is a possible delimiter for the language

indicnlp.langinfo.is_dental(c, lang)[source]

Is the character a dental

indicnlp.langinfo.is_dental_offset(c_offset)[source]

Is the offset a dental

indicnlp.langinfo.is_fricative(c, lang)[source]

Is the character a fricative consonant

indicnlp.langinfo.is_fricative_offset(c_offset)[source]

Is the offset a fricative consonant

indicnlp.langinfo.is_halanta(c, lang)[source]

Is the character the halanta character

indicnlp.langinfo.is_halanta_offset(c_offset)[source]

Is the offset the halanta offset

indicnlp.langinfo.is_indiclang_char(c, lang)[source]

Applicable to Brahmi derived Indic scripts

indicnlp.langinfo.is_labial(c, lang)[source]

Is the character a labial

indicnlp.langinfo.is_labial_offset(c_offset)[source]

Is the offset a labial

indicnlp.langinfo.is_nasal(c, lang)[source]

Is the character a nasal consonant

indicnlp.langinfo.is_nasal_offset(c_offset)[source]

Is the offset a nasal consonant

indicnlp.langinfo.is_nukta(c, lang)[source]

Is the character the halanta character

indicnlp.langinfo.is_nukta_offset(c_offset)[source]

Is the offset the halanta offset

indicnlp.langinfo.is_number(c, lang)[source]

Is the character a number

indicnlp.langinfo.is_number_offset(c_offset)[source]

Is the offset a number

indicnlp.langinfo.is_palatal(c, lang)[source]

Is the character a palatal

indicnlp.langinfo.is_palatal_offset(c_offset)[source]

Is the offset a palatal

indicnlp.langinfo.is_retroflex(c, lang)[source]

Is the character a retroflex

indicnlp.langinfo.is_retroflex_offset(c_offset)[source]

Is the offset a retroflex

indicnlp.langinfo.is_unaspirated(c, lang)[source]

Is the character a unaspirated consonant

indicnlp.langinfo.is_unaspirated_offset(c_offset)[source]

Is the offset a unaspirated consonant

indicnlp.langinfo.is_unvoiced(c, lang)[source]

Is the character a unvoiced consonant

indicnlp.langinfo.is_unvoiced_offset(c_offset)[source]

Is the offset a unvoiced consonant

indicnlp.langinfo.is_velar(c, lang)[source]

Is the character a velar

indicnlp.langinfo.is_velar_offset(c_offset)[source]

Is the offset a velar

indicnlp.langinfo.is_voiced(c, lang)[source]

Is the character a voiced consonant

indicnlp.langinfo.is_voiced_offset(c_offset)[source]

Is the offset a voiced consonant

indicnlp.langinfo.is_vowel(c, lang)[source]

Is the character a vowel

indicnlp.langinfo.is_vowel_offset(c_offset)[source]

Is the offset a vowel

indicnlp.langinfo.is_vowel_sign(c, lang)[source]

Is the character a vowel sign (maatraa)

indicnlp.langinfo.is_vowel_sign_offset(c_offset)[source]

Is the offset a vowel sign (maatraa)

indicnlp.langinfo.offset_to_char(c, lang)[source]

Applicable to Brahmi derived Indic scripts

loader Module

indicnlp.loader.load()[source]

Initializes the Indic NLP library. Clients should call this method before using the library.

Any module requiring initialization should have a init() method, to which a call must be made from this method

Subpackages

cli Package

cliparser Module

indicnlp.cli.cliparser.add_common_bilingual_args(task_parser)[source]
indicnlp.cli.cliparser.add_common_monolingual_args(task_parser)[source]
indicnlp.cli.cliparser.add_detokenize_parser(subparsers)[source]
indicnlp.cli.cliparser.add_indic2roman_parser(subparsers)[source]
indicnlp.cli.cliparser.add_morph_parser(subparsers)[source]
indicnlp.cli.cliparser.add_normalize_parser(subparsers)[source]
indicnlp.cli.cliparser.add_roman2indic_parser(subparsers)[source]
indicnlp.cli.cliparser.add_script_convert_parser(subparsers)[source]
indicnlp.cli.cliparser.add_script_unify_parser(subparsers)[source]
indicnlp.cli.cliparser.add_sentence_split_parser(subparsers)[source]
indicnlp.cli.cliparser.add_syllabify_parser(subparsers)[source]
indicnlp.cli.cliparser.add_tokenize_parser(subparsers)[source]
indicnlp.cli.cliparser.add_wc_parser(subparsers)[source]
indicnlp.cli.cliparser.get_parser()[source]
indicnlp.cli.cliparser.main()[source]
indicnlp.cli.cliparser.run_detokenize(args)[source]
indicnlp.cli.cliparser.run_indic2roman(args)[source]
indicnlp.cli.cliparser.run_morph(args)[source]
indicnlp.cli.cliparser.run_normalize(args)[source]
indicnlp.cli.cliparser.run_roman2indic(args)[source]
indicnlp.cli.cliparser.run_script_convert(args)[source]
indicnlp.cli.cliparser.run_script_unify(args)[source]
indicnlp.cli.cliparser.run_sentence_split(args)[source]
indicnlp.cli.cliparser.run_syllabify(args)[source]
indicnlp.cli.cliparser.run_tokenize(args)[source]
indicnlp.cli.cliparser.run_wc(args)[source]

morph Package

unsupervised_morph Module

class indicnlp.morph.unsupervised_morph.MorphAnalyzerI[source]

Bases: object

Interface for Morph Analyzer

morph_analyze()[source]
morph_analyze_document()[source]
class indicnlp.morph.unsupervised_morph.UnsupervisedMorphAnalyzer(lang, add_marker=False)[source]

Bases: indicnlp.morph.unsupervised_morph.MorphAnalyzerI

Unsupervised Morphological analyser built using Morfessor 2.0

morph_analyze[source]

Morphanalyzes a single word and returns a list of component morphemes

@param word: string input word

morph_analyze_document(tokens)[source]

Morphanalyzes a document, represented as a list of tokens Each word is morphanalyzed and result is a list of morphemes constituting the document

@param tokens: string sequence of words

@return list of segments in the document after morph analysis

normalize Package

indic_normalize Module

class indicnlp.normalize.indic_normalize.BaseNormalizer(lang, remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.NormalizerI

correct_visarga(text, visarga_char, char_range)[source]
get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.BengaliNormalizer(lang='bn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_assamese_chars=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Bengali script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '়'
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.DevanagariNormalizer(lang='hi', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Devanagari script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '़'
get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.GujaratiNormalizer(lang='gu', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Gujarati script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '઼'
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.GurmukhiNormalizer(lang='pa', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_addak=False, do_canonicalize_tippi=False, do_replace_vowel_bases=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Gurmukhi script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '਼'
VOWEL_NORM_MAPS = {'ਅਾ': 'ਆ', 'ਅੈ': 'ਐ', 'ਅੌ': 'ਔ', 'ੲਿ': 'ਇ', 'ੲੀ': 'ਈ', 'ੲੇ': 'ਏ', 'ੳੁ': 'ਉ', 'ੳੂ': 'ਊ', 'ੳੋ': 'ਓ'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.IndicNormalizerFactory[source]

Bases: object

Factory class to create language specific normalizers.

get_normalizer(language, **kwargs)[source]

Call the get_normalizer function to get the language specific normalizer Paramters: |language: language code |remove_nuktas: boolean, should the normalizer remove nukta characters

is_language_supported(language)[source]

Is the language supported?

class indicnlp.normalize.indic_normalize.KannadaNormalizer(lang='kn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Kannada script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.MalayalamNormalizer(lang='ml', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_chillus=False, do_correct_geminated_T=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Malayalam script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * Change from old encoding of chillus (till Unicode 5.0) to new encoding * replace colon ‘:’ by visarga if the colon follows a charcter in this script

CHILLU_CHAR_MAP = {'ൺ': 'ണ', 'ൻ': 'ന', 'ർ': 'ര', 'ൽ': 'ല', 'ൾ': 'ള', 'ൿ': 'ക'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.NormalizerI[source]

Bases: object

The normalizer classes do the following: * Some characters have multiple Unicode codepoints. The normalizer chooses a single standard representation * Some control characters are deleted * While typing using the Latin keyboard, certain typical mistakes occur which are corrected by the module Base class for normalizer. Performs some common normalization, which includes: * Byte order mark, word joiner, etc. removal * ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal * ZERO_WIDTH_SPACE and NO_BREAK_SPACE replaced by spaces Script specific normalizers should derive from this class and override the normalize() method. They can call the super class ‘normalize() method to avail of the common normalization

BYTE_ORDER_MARK = '\ufeff'
BYTE_ORDER_MARK_2 = '\ufffe'
NO_BREAK_SPACE = '\xa0'
SOFT_HYPHEN = '\xad'
WORD_JOINER = '\u2060'
ZERO_WIDTH_JOINER = '\u200d'
ZERO_WIDTH_NON_JOINER = '\u200c'
ZERO_WIDTH_SPACE = '\u200b'
normalize(text)[source]
class indicnlp.normalize.indic_normalize.OriyaNormalizer(lang='or', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_wa=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Oriya script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * Replace ‘va’ with ‘ba’ * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '଼'
VOWEL_NORM_MAPS = {'ଅା': 'ଆ', 'ଏୗ': 'ଐ', 'ଓୗ': 'ଔ'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.TamilNormalizer(lang='ta', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Tamil script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.TeluguNormalizer(lang='te', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Teluguscript. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.UrduNormalizer(lang, remove_nuktas=True)[source]

Bases: indicnlp.normalize.indic_normalize.NormalizerI

Uses UrduHack library. https://docs.urduhack.com/en/stable/_modules/urduhack/normalization/character.html#normalize

normalize(text)[source]

script Package

indic_scripts Module

indicnlp.script.indic_scripts.ALL_PHONETIC_DATA = None

Phonetic data for Tamil

indicnlp.script.indic_scripts.ALL_PHONETIC_VECTORS = None

Phonetic vector for Tamil

indicnlp.script.indic_scripts.PHONETIC_VECTOR_LENGTH = 38

Start offset for the phonetic feature vector in the phonetic data vector

indicnlp.script.indic_scripts.TAMIL_PHONETIC_DATA = None

Phonetic vector for all languages except Tamil

indicnlp.script.indic_scripts.TAMIL_PHONETIC_VECTORS = None

Length of phonetic vector

indicnlp.script.indic_scripts.get_offset(c, lang)[source]
indicnlp.script.indic_scripts.get_phonetic_feature_vector(c, lang)[source]
indicnlp.script.indic_scripts.get_phonetic_feature_vector_offset(offset, lang)[source]
indicnlp.script.indic_scripts.get_phonetic_info(lang)[source]
indicnlp.script.indic_scripts.get_property_value(v, prop_name)[source]
indicnlp.script.indic_scripts.get_property_vector(v, prop_name)[source]
indicnlp.script.indic_scripts.in_coordinated_range(c, lang)[source]
indicnlp.script.indic_scripts.in_coordinated_range_offset(c_offset)[source]

Applicable to Brahmi derived Indic scripts

indicnlp.script.indic_scripts.init()[source]

To be called by library loader, do not call it in your program

indicnlp.script.indic_scripts.invalid_vector()[source]
indicnlp.script.indic_scripts.is_anusvaar(v)[source]
indicnlp.script.indic_scripts.is_consonant(v)[source]
indicnlp.script.indic_scripts.is_dependent_vowel(v)[source]
indicnlp.script.indic_scripts.is_halant(v)[source]
indicnlp.script.indic_scripts.is_indiclang_char(c, lang)[source]

Applicable to Brahmi derived Indic scripts Note that DANDA and DOUBLE_DANDA have the same Unicode codepoint for all Indic scripts

indicnlp.script.indic_scripts.is_misc(v)[source]
indicnlp.script.indic_scripts.is_nukta(v)[source]
indicnlp.script.indic_scripts.is_plosive(v)[source]
indicnlp.script.indic_scripts.is_supported_language(lang)[source]
indicnlp.script.indic_scripts.is_valid(v)[source]
indicnlp.script.indic_scripts.is_vowel(v)[source]
indicnlp.script.indic_scripts.lcsr(srcw, tgtw, slang, tlang)[source]

compute the Longest Common Subsequence Ratio (LCSR) between two strings at the character level.

srcw: source language string tgtw: source language string slang: source language tlang: target language

indicnlp.script.indic_scripts.lcsr_any(srcw, tgtw)[source]

LCSR computation if both languages have the same script

indicnlp.script.indic_scripts.lcsr_indic(srcw, tgtw, slang, tlang)[source]

compute the Longest Common Subsequence Ratio (LCSR) between two strings at the character level. This works for Indic scripts by mapping both languages to a common script

srcw: source language string tgtw: source language string slang: source language tlang: target language

indicnlp.script.indic_scripts.offset_to_char(off, lang)[source]

Applicable to Brahmi derived Indic scripts

indicnlp.script.indic_scripts.or_vectors(v1, v2)[source]
indicnlp.script.indic_scripts.xor_vectors(v1, v2)[source]

english_script Module

indicnlp.script.english_script.ENGLISH_PHONETIC_DATA = None

Phonetic vector for English

indicnlp.script.english_script.ENGLISH_PHONETIC_VECTORS = None

Length of phonetic vector

indicnlp.script.english_script.ID_ARPABET_MAP = {}

Phonetic data for English

indicnlp.script.english_script.PHONETIC_VECTOR_LENGTH = 38

Start offset for the phonetic feature vector in the phonetic data vector

indicnlp.script.english_script.enc_to_offset(c)[source]
indicnlp.script.english_script.enc_to_phoneme(ph)[source]
indicnlp.script.english_script.get_phonetic_feature_vector(p, lang)[source]
indicnlp.script.english_script.get_phonetic_info(lang)[source]
indicnlp.script.english_script.in_range(offset)[source]
indicnlp.script.english_script.init()[source]

To be called by library loader, do not call it in your program

indicnlp.script.english_script.invalid_vector()[source]
indicnlp.script.english_script.offset_to_phoneme(ph_id)[source]
indicnlp.script.english_script.phoneme_to_enc(ph)[source]
indicnlp.script.english_script.phoneme_to_offset(ph)[source]

phonetic_sim Module

indicnlp.script.phonetic_sim.cosine(v1, v2)[source]
indicnlp.script.phonetic_sim.create_similarity_matrix(sim_func, slang, tlang, normalize=True)[source]
indicnlp.script.phonetic_sim.dice(v1, v2)[source]
indicnlp.script.phonetic_sim.dotprod(v1, v2)[source]
indicnlp.script.phonetic_sim.equal(v1, v2)[source]
indicnlp.script.phonetic_sim.jaccard(v1, v2)[source]
indicnlp.script.phonetic_sim.sim1(v1, v2, base=5.0)[source]
indicnlp.script.phonetic_sim.softmax(v1, v2)[source]

syllable Package

syllabifier Module

indicnlp.syllable.syllabifier.char_backoff(syllables_list, vocab)[source]
indicnlp.syllable.syllabifier.denormalize_malayalam(word, word_mask)[source]
indicnlp.syllable.syllabifier.denormalize_punjabi(word, word_mask)[source]
indicnlp.syllable.syllabifier.normalize_malayalam(word)[source]
indicnlp.syllable.syllabifier.normalize_punjabi(word)[source]
indicnlp.syllable.syllabifier.orthographic_simple_syllabify(word, lang, vocab=None)[source]
indicnlp.syllable.syllabifier.orthographic_syllabify(word, lang, vocab=None)[source]
indicnlp.syllable.syllabifier.orthographic_syllabify_improved(word, lang, vocab=None)[source]

tokenize Package

indic_tokenize Module

Tokenizer for Indian languages. Currently, simple punctuation-based tokenizers are supported (see trivial_tokenize). Major Indian language punctuations are handled.

indicnlp.tokenize.indic_tokenize.trivial_tokenize(text, lang='hi')[source]

trivial tokenizer for Indian languages using Brahmi for Arabic scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. Major punctuations specific to Indian langauges are handled. These punctuations characters were identified from the Unicode database.

Parameters:
  • text (str) – text to tokenize
  • lang (str) – ISO 639-2 language code
Returns:

list of tokens

Return type:

list

indicnlp.tokenize.indic_tokenize.trivial_tokenize_indic(text)[source]

tokenize string for Indian language scripts using Brahmi-derived scripts

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). This is a language independent tokenizer

Parameters:text (str) – text to tokenize
Returns:list of tokens
Return type:list
indicnlp.tokenize.indic_tokenize.trivial_tokenize_urdu(text)[source]

tokenize Urdu string

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Urdu script. These punctuations characters were identified from the Unicode database for Arabic script by looking for punctuation symbols.

Parameters:text (str) – text to tokenize
Returns:list of tokens
Return type:list

indic_detokenize Module

De-tokenizer for Indian languages.

indicnlp.tokenize.indic_detokenize.trivial_detokenize(text, lang='hi')[source]

detokenize string for languages of the Indian subcontinent

A trivial detokenizer which:

  • decides whether punctuation attaches to left/right or both
  • handles number sequences
  • handles quotes smartly (deciding left or right attachment)
Parameters:text (str) – tokenized text to process
Returns:detokenized string
Return type:str
Raises:IndicNlpException – If language is not supported
indicnlp.tokenize.indic_detokenize.trivial_detokenize_indic(text)[source]

detokenize string for Indian language scripts using Brahmi-derived scripts

A trivial detokenizer which:

  • decides whether punctuation attaches to left/right or both
  • handles number sequences
  • handles quotes smartly (deciding left or right attachment)
Parameters:text (str) – tokenized text to process
Returns:detokenized string
Return type:str

sentence_tokenize Module

Sentence splitter for Indian languages. Contains a rule-based sentence splitter that can understand common non-breaking phrases in many Indian languages.

indicnlp.tokenize.sentence_tokenize.is_acronym_abbvr(text, lang)[source]

Is the text a non-breaking phrase

Parameters:
  • text (str) – text to check for non-breaking phrase
  • lang (str) – ISO 639-2 language code
Returns:

true if text is a non-breaking phrase

Return type:

boolean

indicnlp.tokenize.sentence_tokenize.is_latin_or_numeric(character)[source]

Check if a character is a Latin character (uppercase or lowercase) or a number.

Parameters:character (str) – The character to be checked.
Returns:True if the character is a Latin character or a number, False otherwise.
Return type:bool
indicnlp.tokenize.sentence_tokenize.sentence_split(text, lang, delim_pat='auto')[source]

split the text into sentences

A rule-based sentence splitter for Indian languages written in Brahmi-derived scripts. The text is split at sentence delimiter boundaries. The delimiters can be configured by passing appropriate parameters.

The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages.

Parameters:
  • text (str) – text to split into sentence
  • lang (str) – ISO 639-2 language code
  • delim_pat (str) – regular expression to identify sentence delimiter characters. If set to ‘auto’, the delimiter pattern is chosen automatically based on the language and text.
Returns:

list of sentences identified from the input text

Return type:

list

transliterate Package

sinhala_transliterator Module

class indicnlp.transliterate.sinhala_transliterator.SinhalaDevanagariTransliterator[source]

Bases: object

A Devanagari to Sinhala transliterator based on explicit Unicode Mapping

static devanagari_to_sinhala(text)[source]
devnag_sinhala_map = {'ऀ': 'ං', 'ँ': 'ං', 'ं': 'ං', 'ः': 'ඃ', 'ऄ': '\u0d84', 'अ': 'අ', 'आ': 'ආ', 'इ': 'ඉ', 'ई': 'ඊ', 'उ': 'උ', 'ऊ': 'ඌ', 'ऋ': 'ඍ', 'ऌ': 'ඏ', 'ऍ': 'ඈ', 'ऎ': 'එ', 'ए': 'ඒ', 'ऐ': 'ඓ', 'ऒ': 'ඔ', 'ओ': 'ඕ', 'औ': 'ඖ', 'क': 'ක', 'ख': 'ඛ', 'ग': 'ග', 'घ': 'ඝ', 'ङ': 'ඞ', 'च': 'ච', 'छ': 'ඡ', 'ज': 'ජ', 'झ': 'ඣ', 'ञ': 'ඤ', 'ट': 'ට', 'ठ': 'ඨ', 'ड': 'ඩ', 'ढ': 'ඪ', 'ण': 'ණ', 'त': 'ත', 'थ': 'ථ', 'द': 'ද', 'ध': 'ධ', 'न': 'න', 'ऩ': 'න', 'प': 'ප', 'फ': 'ඵ', 'ब': 'බ', 'भ': 'භ', 'म': 'ම', 'य': 'ය', 'र': 'ර', 'ल': 'ල', 'ळ': 'ළ', 'व': 'ව', 'श': 'ශ', 'ष': 'ෂ', 'स': 'ස', 'ह': 'හ', 'ा': 'ා', 'ि': 'ි', 'ी': 'ී', 'ु': 'ු', 'ू': 'ූ', 'ृ': 'ෘ', 'ॆ': 'ෙ', 'े': 'ේ', 'ै': 'ෛ', 'ॉ': 'ෑ', 'ॊ': 'ො', 'ो': 'ෝ', 'ौ': 'ෞ', '्': '්'}
sinhala_devnag_map = {'ං': 'ं', 'ඃ': 'ः', '\u0d84': 'ऄ', 'අ': 'अ', 'ආ': 'आ', 'ඇ': 'ऍ', 'ඈ': 'ऍ', 'ඉ': 'इ', 'ඊ': 'ई', 'උ': 'उ', 'ඌ': 'ऊ', 'ඍ': 'ऋ', 'ඏ': 'ऌ', 'එ': 'ऎ', 'ඒ': 'ए', 'ඓ': 'ऐ', 'ඔ': 'ऒ', 'ඕ': 'ओ', 'ඖ': 'औ', 'ක': 'क', 'ඛ': 'ख', 'ග': 'ग', 'ඝ': 'घ', 'ඞ': 'ङ', 'ඟ': 'ङ', 'ච': 'च', 'ඡ': 'छ', 'ජ': 'ज', 'ඣ': 'झ', 'ඤ': 'ञ', 'ඥ': 'ञ', 'ඦ': 'ञ', 'ට': 'ट', 'ඨ': 'ठ', 'ඩ': 'ड', 'ඪ': 'ढ', 'ණ': 'ण', 'ඬ': 'ण', 'ත': 'त', 'ථ': 'थ', 'ද': 'द', 'ධ': 'ध', 'න': 'न', '\u0db2': 'न', 'ඳ': 'न', 'ප': 'प', 'ඵ': 'फ', 'බ': 'ब', 'භ': 'भ', 'ම': 'म', 'ය': 'य', 'ර': 'र', 'ල': 'ल', 'ව': 'व', 'ශ': 'श', 'ෂ': 'ष', 'ස': 'स', 'හ': 'ह', 'ළ': 'ळ', '්': '्', 'ා': 'ा', 'ැ': 'ॉ', 'ෑ': 'ॉ', 'ි': 'ि', 'ී': 'ी', 'ු': 'ु', 'ූ': 'ू', 'ෘ': 'ृ', 'ෙ': 'ॆ', 'ේ': 'े', 'ෛ': 'ै', 'ො': 'ॊ', 'ෝ': 'ो', 'ෞ': 'ौ'}
static sinhala_to_devanagari(text)[source]

unicode_transliterate Module

class indicnlp.transliterate.unicode_transliterate.ItransTransliterator[source]

Bases: object

Transliterator between Indian scripts and ITRANS

static from_itrans(text, lang)[source]

TODO: Document this method properly TODO: A little hack is used to handle schwa: needs to be documented TODO: check for robustness

static to_itrans(text, lang_code)[source]
class indicnlp.transliterate.unicode_transliterate.UnicodeIndicTransliterator[source]

Bases: object

Base class for rule-based transliteration among Indian languages.

Script pair specific transliterators should derive from this class and override the transliterate() method. They can call the super class ‘transliterate()’ method to avail of the common transliteration

static transliterate(text, lang1_code, lang2_code)[source]

convert the source language script (lang1) to target language script (lang2)

text: text to transliterate lang1_code: language 1 code lang1_code: language 2 code

indicnlp.transliterate.unicode_transliterate.init()[source]

To be called by library loader, do not call it in your program

acronym_transliterator Module

class indicnlp.transliterate.acronym_transliterator.LatinToIndicAcronymTransliterator[source]

Bases: object

LATIN_ALPHABET = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
LATIN_TO_DEVANAGARI_TRANSTABLE = {97: 'ए', 98: 'बी', 99: 'सी', 100: 'डी', 101: 'ई', 102: 'एफ', 103: 'जी', 104: 'एच', 105: 'आई', 106: 'जे', 107: 'के', 108: 'एल', 109: 'एम', 110: 'एन', 111: 'ओ', 112: 'पी', 113: 'क्यू', 114: 'आर', 115: 'एस', 116: 'टी', 117: 'यू', 118: 'वी', 119: 'डब्ल्यू', 120: 'एक्स', 121: 'वाय', 122: 'जेड'}
static generate_latin_acronyms(num_acronyms, min_len=2, max_len=6, strategy='random')[source]

generate Latin acronyms in lower case

static get_transtable()[source]
static transliterate(w, lang)[source]

script_unifier Module

class indicnlp.transliterate.script_unifier.AggressiveScriptUnifier(common_lang='hi', nasals_mode='to_nasal_consonants')[source]

Bases: object

transform(text, lang)[source]
class indicnlp.transliterate.script_unifier.BasicScriptUnifier(common_lang='hi', nasals_mode='do_nothing')[source]

Bases: object

transform(text, lang)[source]
class indicnlp.transliterate.script_unifier.NaiveScriptUnifier(common_lang='hi')[source]

Bases: object

transform(text, lang)[source]

Indices and tables

Commandline

usage: cliparser.py [-h]
                    {tokenize,detokenize,sentence_split,normalize,morph,syllabify,wc,indic2roman,roman2indic,script_unify,script_convert}
                    ...

Positional Arguments

subcommand

Possible choices: tokenize, detokenize, sentence_split, normalize, morph, syllabify, wc, indic2roman, roman2indic, script_unify, script_convert

Invoke each operation with one of the subcommands

Sub-commands

tokenize

tokenizer help

cliparser.py tokenize [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

detokenize

de-tokenizer help

cliparser.py detokenize [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

sentence_split

sentence split help

cliparser.py sentence_split [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

normalize

normalizer help

cliparser.py normalize [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

morph

morph help

cliparser.py morph [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

syllabify

syllabify help

cliparser.py syllabify [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

wc

wc help

cliparser.py wc [-h] [infile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

indic2roman

indic2roman help

cliparser.py indic2roman [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

roman2indic

roman2indic help

cliparser.py roman2indic [-h] [-l LANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language

script_unify

script_unify help

cliparser.py script_unify [-h] [-l LANG] [-m {naive,basic,aggressive}]
                          [-c COMMON_LANG]
                          [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-l, --lang Language
-m, --mode

Possible choices: naive, basic, aggressive

Script unification mode

Default: “basic”

-c, --common_lang
 

Common language in which all languages are represented

Default: “hi”

script_convert

script convert help

cliparser.py script_convert [-h] [-s SRCLANG] [-t TGTLANG] [infile] [outfile]

Positional Arguments

infile

Input File path

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

outfile

Output File path

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

Named Arguments

-s, --srclang Source Language
-t, --tgtlang Target Language