Character mapping script from arabic to latin. This original style does not have same mapping. In other words, all mappings are the one-to-one correspondances. Thus, this script can map without loosing the original character information.
For latin characters, I used Latin Supplement, Latin Extended-A, Latin Extended-B characters. To make available a keyboard that includes Latin Supplement, Latin Extended-A, Latin Extended-B, please see http://symbolcodes.tlt.psu.edu/accents/codemacext.html
To get romanized characters, codes like below in if __name__=='__main__'
ex_sent=u'من کتاب دادم';
ins=transliter(ex_sent);
print ins.main();
If you wanna get Arabic sequence form unicode sequence, use
ex_sent_2=u'mn tw ra dÝdm';
ins_2=transliter(ex_sent_2);
print ins_2.unicode_to_arabic();
Note: I recommend to remove all diacritics before you convert to latin characters. To remove them, please use def clean_up()
This class converts arabic characters to ratin characters. Not all characters are supported, but basic characters are completly converted. I think that enought to express normal writing style.
In the function original_unicode(), character mapping chart is below.(2013/7/23) See also: http://jrgraphix.net/research/unicode_blocks.php?block=12
Arabic_character | Arabic_character_commonly_called | unicode_number(Hex) | mapped_ratin |
---|---|---|---|
ا | alef | U+0627 | a |
آ | alef_madda_above | U+0622 | ā |
ب | be | U+06FF | b |
پ | pe | U+067E | p |
ت | te | U+062A | t |
ث | se | U+062B | ç |
ج | jim | U+062C | j |
چ | ˇce | U+0686 | č |
ح | he | U+062D | ħ |
خ | xe | U+062E | x |
د | dal | U+062F | d |
ذ | zal | U+0630 | đ |
ر | re | U+0631 | r |
ز | ze | U+0632 | z |
ژ | ˇze | U+0698 | ž |
س | sin | U+0633 | s |
ش | ˇsin | U+0634 | š |
ص | sad | U+0635 | ş |
ض | zad | U+0636 | ź |
ط | ta | U+0637 | ţ |
ظ | za | U+0638 | ẓ |
ع | 'eyn | U+0639 | ' |
غ | qeyn | U+063A | q |
ف | fe | U+0641 | f |
ق | qaf | U+0642 | ŕ |
ک | persian-kaf | U+06A9 | K |
arabic-kaf | U+0643 | K | |
گ | gaf | U+06AF | g |
ل | lam | U+0644 | l |
م | mim | U+0645 | m |
ن | nun | U+0646 | n |
و | vav | U+0645 | w |
ه | he | U+0647 | e |
ۀ | he_yeh_above | U+06C0 | X |
ی | persian-ye | U+06CC | y |
ي | arabic-ye | U+064A | Y |
ى | arabic_alef_maksusa | U+0649 | Ý |
، | arabic_comma | U+060C | , |
؛ | arabic_semicolon | U+061B | ; |
؟ | arabic_question | U+061F | ? |
ة | arabic_heh_hamza_above | U+0629 | T |
٪ | arabic_percent | U+066A | % |
Zero-Width-Non-Joiner | U+200C | _ | |
arabic_hamza_above | U+0654 | ú | |
arabic_hamza_below | U+0655 | E | |
arabic_alef_hamza_above | U+0623 | á | |
arabic_hamza | U+0621 | ° | |
٫ | arabic_decimal_separator | U+066B | ⎖ |
- v0.01: First Version
- v0.02: fixed bug that ث س have the same mapping 'se'. changed ث to se¥_1 and س to se¥_2.
- v0.03: added 'ARABIC_ALEF_MAKSUSA' and its mapping to 'Ý'. 'PERSIAN_YEH' and its mapping is changed to 'y'.
- v0.04: changed a mapping of 'persian_kaf' to 'k'. The mappig 'ه' to 'h' is disused. Instead, 'e' is used for 'ه'. 'arabic_fathatan' is disused.
- v0.05: Added new map for arabic tatweel and arabic decimal separator. An arabic tatweel is \u0640 in unicode and is reffered as keyname 'arabic_tatweel' in code. An arabic decimal separator is \u066b in unicode, and is reffered as keyname 'arabic_decimal_separator' in code.
- v0.06: some character + diacritic is added. u+0625, u+0629, u+06c0, u+0695, u+0624
- v0.07: arabic fathatan is added
- v0.08: A bug that clean_up function does not work solved.
- v0.09: ۀ is added
- v0.10: Per_Per2 module is added with in this system. About Per_per2.rb, please look at here