indicnlp Package¶
common
Module¶
-
exception
indicnlp.common.
IndicNlpException
(msg)[source]¶ Bases:
Exception
Exceptions thrown by Indic NLP Library components are instances of this class. ‘msg’ attribute contains exception details.
langinfo
Module¶
-
indicnlp.langinfo.
in_coordinated_range
(c_offset)[source]¶ Applicable to Brahmi derived Indic scripts
loader
Module¶
Subpackages¶
morph Package¶
unsupervised_morph
Module¶
-
class
indicnlp.morph.unsupervised_morph.
MorphAnalyzerI
[source]¶ Bases:
object
Interface for Morph Analyzer
-
class
indicnlp.morph.unsupervised_morph.
UnsupervisedMorphAnalyzer
(lang, add_marker=False)[source]¶ Bases:
indicnlp.morph.unsupervised_morph.MorphAnalyzerI
Unsupervised Morphological analyser built using Morfessor 2.0
normalize Package¶
indic_normalize
Module¶
-
class
indicnlp.normalize.indic_normalize.
BaseNormalizer
(lang, remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶
-
class
indicnlp.normalize.indic_normalize.
BengaliNormalizer
(lang='bn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_assamese_chars=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Bengali script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
NUKTA
= '়'¶
-
-
class
indicnlp.normalize.indic_normalize.
DevanagariNormalizer
(lang='hi', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Devanagari script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
NUKTA
= '़'¶
-
-
class
indicnlp.normalize.indic_normalize.
GujaratiNormalizer
(lang='gu', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Gujarati script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
NUKTA
= '઼'¶
-
-
class
indicnlp.normalize.indic_normalize.
GurmukhiNormalizer
(lang='pa', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_addak=False, do_canonicalize_tippi=False, do_replace_vowel_bases=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Gurmukhi script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
NUKTA
= '਼'¶
-
VOWEL_NORM_MAPS
= {'ਅਾ': 'ਆ', 'ਅੈ': 'ਐ', 'ਅੌ': 'ਔ', 'ੲਿ': 'ਇ', 'ੲੀ': 'ਈ', 'ੲੇ': 'ਏ', 'ੳੁ': 'ਉ', 'ੳੂ': 'ਊ', 'ੳੋ': 'ਓ'}¶
-
-
class
indicnlp.normalize.indic_normalize.
IndicNormalizerFactory
[source]¶ Bases:
object
Factory class to create language specific normalizers.
-
class
indicnlp.normalize.indic_normalize.
KannadaNormalizer
(lang='kn', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Kannada script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
class
indicnlp.normalize.indic_normalize.
MalayalamNormalizer
(lang='ml', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_canonicalize_chillus=False, do_correct_geminated_T=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Malayalam script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * Change from old encoding of chillus (till Unicode 5.0) to new encoding * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
CHILLU_CHAR_MAP
= {'ൺ': 'ണ', 'ൻ': 'ന', 'ർ': 'ര', 'ൽ': 'ല', 'ൾ': 'ള', 'ൿ': 'ക'}¶
-
-
class
indicnlp.normalize.indic_normalize.
NormalizerI
[source]¶ Bases:
object
The normalizer classes do the following: * Some characters have multiple Unicode codepoints. The normalizer chooses a single standard representation * Some control characters are deleted * While typing using the Latin keyboard, certain typical mistakes occur which are corrected by the module Base class for normalizer. Performs some common normalization, which includes: * Byte order mark, word joiner, etc. removal * ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal * ZERO_WIDTH_SPACE and NO_BREAK_SPACE replaced by spaces Script specific normalizers should derive from this class and override the normalize() method. They can call the super class ‘normalize() method to avail of the common normalization
-
BYTE_ORDER_MARK
= '\ufeff'¶
-
BYTE_ORDER_MARK_2
= '\ufffe'¶
-
NO_BREAK_SPACE
= '\xa0'¶
-
SOFT_HYPHEN
= '\xad'¶
-
WORD_JOINER
= '\u2060'¶
-
ZERO_WIDTH_JOINER
= '\u200d'¶
-
ZERO_WIDTH_NON_JOINER
= '\u200c'¶
-
ZERO_WIDTH_SPACE
= '\u200b'¶
-
-
class
indicnlp.normalize.indic_normalize.
OriyaNormalizer
(lang='or', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False, do_remap_wa=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Oriya script. In addition to basic normalization by the super class, * Replaces the composite characters containing nuktas by their decomposed form * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * Canonicalize two part dependent vowels * Replace ‘va’ with ‘ba’ * replace pipe character ‘|’ by poorna virama character * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
NUKTA
= '଼'¶
-
VOWEL_NORM_MAPS
= {'ଅା': 'ଆ', 'ଏୗ': 'ଐ', 'ଓୗ': 'ଔ'}¶
-
-
class
indicnlp.normalize.indic_normalize.
TamilNormalizer
(lang='ta', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Tamil script. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
class
indicnlp.normalize.indic_normalize.
TeluguNormalizer
(lang='te', remove_nuktas=False, nasals_mode='do_nothing', do_normalize_chandras=False, do_normalize_vowel_ending=False)[source]¶ Bases:
indicnlp.normalize.indic_normalize.BaseNormalizer
Normalizer for the Teluguscript. In addition to basic normalization by the super class, * Replace the reserved character for poorna virama (if used) with the recommended generic Indic scripts poorna virama * canonicalize two-part dependent vowel signs * replace colon ‘:’ by visarga if the colon follows a charcter in this script
-
class
indicnlp.normalize.indic_normalize.
UrduNormalizer
(lang, remove_nuktas=True)[source]¶ Bases:
indicnlp.normalize.indic_normalize.NormalizerI
Uses UrduHack library. https://docs.urduhack.com/en/stable/_modules/urduhack/normalization/character.html#normalize
script Package¶
indic_scripts
Module¶
-
indicnlp.script.indic_scripts.
ALL_PHONETIC_DATA
= None¶ Phonetic data for Tamil
-
indicnlp.script.indic_scripts.
ALL_PHONETIC_VECTORS
= None¶ Phonetic vector for Tamil
-
indicnlp.script.indic_scripts.
PHONETIC_VECTOR_LENGTH
= 38¶ Start offset for the phonetic feature vector in the phonetic data vector
-
indicnlp.script.indic_scripts.
TAMIL_PHONETIC_DATA
= None¶ Phonetic vector for all languages except Tamil
-
indicnlp.script.indic_scripts.
TAMIL_PHONETIC_VECTORS
= None¶ Length of phonetic vector
-
indicnlp.script.indic_scripts.
in_coordinated_range_offset
(c_offset)[source]¶ Applicable to Brahmi derived Indic scripts
-
indicnlp.script.indic_scripts.
init
()[source]¶ To be called by library loader, do not call it in your program
-
indicnlp.script.indic_scripts.
is_indiclang_char
(c, lang)[source]¶ Applicable to Brahmi derived Indic scripts Note that DANDA and DOUBLE_DANDA have the same Unicode codepoint for all Indic scripts
-
indicnlp.script.indic_scripts.
lcsr
(srcw, tgtw, slang, tlang)[source]¶ compute the Longest Common Subsequence Ratio (LCSR) between two strings at the character level.
srcw: source language string tgtw: source language string slang: source language tlang: target language
-
indicnlp.script.indic_scripts.
lcsr_any
(srcw, tgtw)[source]¶ LCSR computation if both languages have the same script
-
indicnlp.script.indic_scripts.
lcsr_indic
(srcw, tgtw, slang, tlang)[source]¶ compute the Longest Common Subsequence Ratio (LCSR) between two strings at the character level. This works for Indic scripts by mapping both languages to a common script
srcw: source language string tgtw: source language string slang: source language tlang: target language
english_script
Module¶
-
indicnlp.script.english_script.
ENGLISH_PHONETIC_DATA
= None¶ Phonetic vector for English
-
indicnlp.script.english_script.
ENGLISH_PHONETIC_VECTORS
= None¶ Length of phonetic vector
-
indicnlp.script.english_script.
ID_ARPABET_MAP
= {}¶ Phonetic data for English
-
indicnlp.script.english_script.
PHONETIC_VECTOR_LENGTH
= 38¶ Start offset for the phonetic feature vector in the phonetic data vector
tokenize Package¶
indic_tokenize
Module¶
Tokenizer for Indian languages. Currently, simple punctuation-based tokenizers are supported (see trivial_tokenize). Major Indian language punctuations are handled.
-
indicnlp.tokenize.indic_tokenize.
trivial_tokenize
(text, lang='hi')[source]¶ trivial tokenizer for Indian languages using Brahmi for Arabic scripts
A trivial tokenizer which just tokenizes on the punctuation boundaries. Major punctuations specific to Indian langauges are handled. These punctuations characters were identified from the Unicode database.
Parameters: - text (str) – text to tokenize
- lang (str) – ISO 639-2 language code
Returns: list of tokens
Return type: list
-
indicnlp.tokenize.indic_tokenize.
trivial_tokenize_indic
(text)[source]¶ tokenize string for Indian language scripts using Brahmi-derived scripts
A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). This is a language independent tokenizer
Parameters: text (str) – text to tokenize Returns: list of tokens Return type: list
-
indicnlp.tokenize.indic_tokenize.
trivial_tokenize_urdu
(text)[source]¶ tokenize Urdu string
A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Urdu script. These punctuations characters were identified from the Unicode database for Arabic script by looking for punctuation symbols.
Parameters: text (str) – text to tokenize Returns: list of tokens Return type: list
indic_detokenize
Module¶
De-tokenizer for Indian languages.
-
indicnlp.tokenize.indic_detokenize.
trivial_detokenize
(text, lang='hi')[source]¶ detokenize string for languages of the Indian subcontinent
A trivial detokenizer which:
- decides whether punctuation attaches to left/right or both
- handles number sequences
- handles quotes smartly (deciding left or right attachment)
Parameters: text (str) – tokenized text to process Returns: detokenized string Return type: str Raises: IndicNlpException
– If language is not supported
-
indicnlp.tokenize.indic_detokenize.
trivial_detokenize_indic
(text)[source]¶ detokenize string for Indian language scripts using Brahmi-derived scripts
A trivial detokenizer which:
- decides whether punctuation attaches to left/right or both
- handles number sequences
- handles quotes smartly (deciding left or right attachment)
Parameters: text (str) – tokenized text to process Returns: detokenized string Return type: str
sentence_tokenize
Module¶
Sentence splitter for Indian languages. Contains a rule-based sentence splitter that can understand common non-breaking phrases in many Indian languages.
-
indicnlp.tokenize.sentence_tokenize.
is_acronym_abbvr
(text, lang)[source]¶ Is the text a non-breaking phrase
Parameters: - text (str) – text to check for non-breaking phrase
- lang (str) – ISO 639-2 language code
Returns: true if text is a non-breaking phrase
Return type: boolean
-
indicnlp.tokenize.sentence_tokenize.
is_latin_or_numeric
(character)[source]¶ Check if a character is a Latin character (uppercase or lowercase) or a number.
Parameters: character (str) – The character to be checked. Returns: True if the character is a Latin character or a number, False otherwise. Return type: bool
-
indicnlp.tokenize.sentence_tokenize.
sentence_split
(text, lang, delim_pat='auto')[source]¶ split the text into sentences
A rule-based sentence splitter for Indian languages written in Brahmi-derived scripts. The text is split at sentence delimiter boundaries. The delimiters can be configured by passing appropriate parameters.
The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages.
Parameters: - text (str) – text to split into sentence
- lang (str) – ISO 639-2 language code
- delim_pat (str) – regular expression to identify sentence delimiter characters. If set to ‘auto’, the delimiter pattern is chosen automatically based on the language and text.
Returns: list of sentences identified from the input text
Return type: list
transliterate Package¶
sinhala_transliterator
Module¶
-
class
indicnlp.transliterate.sinhala_transliterator.
SinhalaDevanagariTransliterator
[source]¶ Bases:
object
A Devanagari to Sinhala transliterator based on explicit Unicode Mapping
-
devnag_sinhala_map
= {'ऀ': 'ං', 'ँ': 'ං', 'ं': 'ං', 'ः': 'ඃ', 'ऄ': '\u0d84', 'अ': 'අ', 'आ': 'ආ', 'इ': 'ඉ', 'ई': 'ඊ', 'उ': 'උ', 'ऊ': 'ඌ', 'ऋ': 'ඍ', 'ऌ': 'ඏ', 'ऍ': 'ඈ', 'ऎ': 'එ', 'ए': 'ඒ', 'ऐ': 'ඓ', 'ऒ': 'ඔ', 'ओ': 'ඕ', 'औ': 'ඖ', 'क': 'ක', 'ख': 'ඛ', 'ग': 'ග', 'घ': 'ඝ', 'ङ': 'ඞ', 'च': 'ච', 'छ': 'ඡ', 'ज': 'ජ', 'झ': 'ඣ', 'ञ': 'ඤ', 'ट': 'ට', 'ठ': 'ඨ', 'ड': 'ඩ', 'ढ': 'ඪ', 'ण': 'ණ', 'त': 'ත', 'थ': 'ථ', 'द': 'ද', 'ध': 'ධ', 'न': 'න', 'ऩ': 'න', 'प': 'ප', 'फ': 'ඵ', 'ब': 'බ', 'भ': 'භ', 'म': 'ම', 'य': 'ය', 'र': 'ර', 'ल': 'ල', 'ळ': 'ළ', 'व': 'ව', 'श': 'ශ', 'ष': 'ෂ', 'स': 'ස', 'ह': 'හ', 'ा': 'ා', 'ि': 'ි', 'ी': 'ී', 'ु': 'ු', 'ू': 'ූ', 'ृ': 'ෘ', 'ॆ': 'ෙ', 'े': 'ේ', 'ै': 'ෛ', 'ॉ': 'ෑ', 'ॊ': 'ො', 'ो': 'ෝ', 'ौ': 'ෞ', '्': '්'}¶
-
sinhala_devnag_map
= {'ං': 'ं', 'ඃ': 'ः', '\u0d84': 'ऄ', 'අ': 'अ', 'ආ': 'आ', 'ඇ': 'ऍ', 'ඈ': 'ऍ', 'ඉ': 'इ', 'ඊ': 'ई', 'උ': 'उ', 'ඌ': 'ऊ', 'ඍ': 'ऋ', 'ඏ': 'ऌ', 'එ': 'ऎ', 'ඒ': 'ए', 'ඓ': 'ऐ', 'ඔ': 'ऒ', 'ඕ': 'ओ', 'ඖ': 'औ', 'ක': 'क', 'ඛ': 'ख', 'ග': 'ग', 'ඝ': 'घ', 'ඞ': 'ङ', 'ඟ': 'ङ', 'ච': 'च', 'ඡ': 'छ', 'ජ': 'ज', 'ඣ': 'झ', 'ඤ': 'ञ', 'ඥ': 'ञ', 'ඦ': 'ञ', 'ට': 'ट', 'ඨ': 'ठ', 'ඩ': 'ड', 'ඪ': 'ढ', 'ණ': 'ण', 'ඬ': 'ण', 'ත': 'त', 'ථ': 'थ', 'ද': 'द', 'ධ': 'ध', 'න': 'न', '\u0db2': 'न', 'ඳ': 'न', 'ප': 'प', 'ඵ': 'फ', 'බ': 'ब', 'භ': 'भ', 'ම': 'म', 'ය': 'य', 'ර': 'र', 'ල': 'ल', 'ව': 'व', 'ශ': 'श', 'ෂ': 'ष', 'ස': 'स', 'හ': 'ह', 'ළ': 'ळ', '්': '्', 'ා': 'ा', 'ැ': 'ॉ', 'ෑ': 'ॉ', 'ි': 'ि', 'ී': 'ी', 'ු': 'ु', 'ූ': 'ू', 'ෘ': 'ृ', 'ෙ': 'ॆ', 'ේ': 'े', 'ෛ': 'ै', 'ො': 'ॊ', 'ෝ': 'ो', 'ෞ': 'ौ'}¶
-
unicode_transliterate
Module¶
-
class
indicnlp.transliterate.unicode_transliterate.
ItransTransliterator
[source]¶ Bases:
object
Transliterator between Indian scripts and ITRANS
-
class
indicnlp.transliterate.unicode_transliterate.
UnicodeIndicTransliterator
[source]¶ Bases:
object
Base class for rule-based transliteration among Indian languages.
Script pair specific transliterators should derive from this class and override the transliterate() method. They can call the super class ‘transliterate()’ method to avail of the common transliteration
acronym_transliterator
Module¶
-
class
indicnlp.transliterate.acronym_transliterator.
LatinToIndicAcronymTransliterator
[source]¶ Bases:
object
-
LATIN_ALPHABET
= ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']¶
-
LATIN_TO_DEVANAGARI_TRANSTABLE
= {97: 'ए', 98: 'बी', 99: 'सी', 100: 'डी', 101: 'ई', 102: 'एफ', 103: 'जी', 104: 'एच', 105: 'आई', 106: 'जे', 107: 'के', 108: 'एल', 109: 'एम', 110: 'एन', 111: 'ओ', 112: 'पी', 113: 'क्यू', 114: 'आर', 115: 'एस', 116: 'टी', 117: 'यू', 118: 'वी', 119: 'डब्ल्यू', 120: 'एक्स', 121: 'वाय', 122: 'जेड'}¶
-
Indices and tables¶
Commandline¶
usage: cliparser.py [-h]
{tokenize,detokenize,sentence_split,normalize,morph,syllabify,wc,indic2roman,roman2indic,script_unify,script_convert}
...
Positional Arguments¶
subcommand | Possible choices: tokenize, detokenize, sentence_split, normalize, morph, syllabify, wc, indic2roman, roman2indic, script_unify, script_convert Invoke each operation with one of the subcommands |
Sub-commands¶
tokenize¶
tokenizer help
cliparser.py tokenize [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
detokenize¶
de-tokenizer help
cliparser.py detokenize [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
sentence_split¶
sentence split help
cliparser.py sentence_split [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
normalize¶
normalizer help
cliparser.py normalize [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
morph¶
morph help
cliparser.py morph [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
syllabify¶
syllabify help
cliparser.py syllabify [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
wc¶
wc help
cliparser.py wc [-h] [infile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
indic2roman¶
indic2roman help
cliparser.py indic2roman [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
roman2indic¶
roman2indic help
cliparser.py roman2indic [-h] [-l LANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
script_unify¶
script_unify help
cliparser.py script_unify [-h] [-l LANG] [-m {naive,basic,aggressive}]
[-c COMMON_LANG]
[infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-l, --lang | Language |
-m, --mode | Possible choices: naive, basic, aggressive Script unification mode Default: “basic” |
-c, --common_lang | |
Common language in which all languages are represented Default: “hi” |
script_convert¶
script convert help
cliparser.py script_convert [-h] [-s SRCLANG] [-t TGTLANG] [infile] [outfile]
Positional Arguments¶
infile | Input File path Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
outfile | Output File path Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
Named Arguments¶
-s, --srclang | Source Language |
-t, --tgtlang | Target Language |