A simple project to identify language of a text
It is a simple project of language identification using n-gram method written on Python.
The n-gram method is implemented in lang_iden.py
file.
Other method based on counting stopwords is available in baseline.py
file for comparison. To run this file, nltk
needs to be installed first.
The implementation is one-shot run, however it is very easy to extend for more general purpose.
python lang_iden.py --h
For instance
python lang_iden.py --n=2 --snippet_len=10
By default the program will run with the texts provided in train_data
and test_data
directory.
If you want to predict a particular text:
>>>python
>>>from lang_iden import *
>>>lang_profiles = train (n = 2) #using 2-gram
>>>predict (lang_profiles, "This is a new text that I want to predict")
{'fr': 3004550.0, 'de': 3003701.0, 'en': 3001772.0, 'it': 3005339.0}
The text should be written in English, because the distance to English profile (3001772) is minimum.
You can get this value
distances = predict (lang_profiles, "This is a new text that I want to predict")
min(distances, key = distances.get)
Right now, English, German, Italian and French are supported. If you want to add more language, just follow the structure of train_data
folder.
For instance, if you want to add Portugese.
- create a folder
pt
insidetrain_data
. - place one or more Portuges texts (in
.txt
format) in this new folder.
That's it.