normalize Package

indic_normalize Module

class indicnlp.normalize.indic_normalize.BaseNormalizer(lang, remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.NormalizerI

correct_visarga(text, visarga_char, char_range)[source]
get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.BengaliNormalizer(lang='bn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_assamese_chars=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Bengali script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '়'
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.DevanagariNormalizer(lang='hi', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Devanagari script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '़'
get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.GujaratiNormalizer(lang='gu', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Gujarati script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '઼'
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.GurmukhiNormalizer(lang='pa', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_addak=False, do_canonicalize_tippi=False, do_replace_vowel_bases=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Gurmukhi script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '਼'
VOWEL_NORM_MAPS = {'ਅਾ': 'ਆ', 'ਅੈ': 'ਐ', 'ਅੌ': 'ਔ', 'ੲਿ': 'ਇ', 'ੲੀ': 'ਈ', 'ੲੇ': 'ਏ', 'ੳੁ': 'ਉ', 'ੳੂ': 'ਊ', 'ੳੋ': 'ਓ'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.IndicNormalizerFactory[source]

Bases: object

Factory class to create language specific normalizers.

get_normalizer(language, **kwargs)[source]

Call the get_normalizer function to get the language specific normalizer Paramters: |language: language code |remove_nuktas: boolean, should the normalizer remove nukta characters

is_language_supported(language)[source]

Is the language supported?

class indicnlp.normalize.indic_normalize.KannadaNormalizer(lang='kn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Kannada script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.MalayalamNormalizer(lang='ml', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_chillus=False, do_correct_geminated_T=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Malayalam script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * Change from old encoding of chillus (till Unicode 5.0) to new encoding * replace colon ‘:’ by visarga if the colon follows a charcter in this script

CHILLU_CHAR_MAP = {'ൺ': 'ണ', 'ൻ': 'ന', 'ർ': 'ര', 'ൽ': 'ല', 'ൾ': 'ള', 'ൿ': 'ക'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.NormalizerI[source]

Bases: object

The normalizer classes do the following: * Some characters have multiple Unicode codepoints. The normalizer chooses a single standard representation * Some control characters are deleted * While typing using the Latin keyboard, certain typical mistakes occur which are corrected by the module Base class for normalizer. Performs some common normalization, which includes: * Byte order mark, word joiner, etc. removal * ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal * ZERO_WIDTH_SPACE and NO_BREAK_SPACE replaced by spaces Script specific normalizers should derive from this class and override the normalize() method. They can call the super class ‘normalize() method to avail of the common normalization

BYTE_ORDER_MARK = '\ufeff'
BYTE_ORDER_MARK_2 = '\ufffe'
NO_BREAK_SPACE = '\xa0'
SOFT_HYPHEN = '\xad'
WORD_JOINER = '\u2060'
ZERO_WIDTH_JOINER = '\u200d'
ZERO_WIDTH_NON_JOINER = '\u200c'
ZERO_WIDTH_SPACE = '\u200b'
normalize(text)[source]
class indicnlp.normalize.indic_normalize.OriyaNormalizer(lang='or', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_wa=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Oriya script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * Replace ‘va’ with ‘ba’ * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script

NUKTA = '଼'
VOWEL_NORM_MAPS = {'ଅା': 'ଆ', 'ଏୗ': 'ଐ', 'ଓୗ': 'ଔ'}
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.TamilNormalizer(lang='ta', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Tamil script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.TeluguNormalizer(lang='te', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]

Bases: indicnlp.normalize.indic_normalize.BaseNormalizer

Normalizer for the Teluguscript. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script

get_char_stats(text)[source]
normalize(text)[source]

Method to be implemented for normalization for each script

class indicnlp.normalize.indic_normalize.UrduNormalizer(lang, remove_nuktas=True)[source]

Bases: indicnlp.normalize.indic_normalize.NormalizerI

Uses UrduHack library. https://docs.urduhack.com/en/stable/_modules/urduhack/normalization/character.html#normalize

normalize(text)[source]