lang_identification

A simple project to identify language of a text

Introduction

It is a simple project of language identification using n-gram method written on Python.

The n-gram method is implemented in lang_iden.py file.

Other method based on counting stopwords is available in baseline.py file for comparison. To run this file, nltk needs to be installed first.

Usage

The implementation is one-shot run, however it is very easy to extend for more general purpose.

python lang_iden.py --h

For instance

python lang_iden.py --n=2 --snippet_len=10

By default the program will run with the texts provided in train_data and test_data directory.

If you want to predict a particular text:

>>>python
>>>from lang_iden import *
>>>lang_profiles = train (n = 2) #using 2-gram
>>>predict (lang_profiles, "This is a new text that I want to predict")
{'fr': 3004550.0, 'de': 3003701.0, 'en': 3001772.0, 'it': 3005339.0}

The text should be written in English, because the distance to English profile (3001772) is minimum.

You can get this value

distances = predict (lang_profiles, "This is a new text that I want to predict")
min(distances, key = distances.get)

Adding language

Right now, English, German, Italian and French are supported. If you want to add more language, just follow the structure of train_data folder.

For instance, if you want to add Portugese.

create a folder pt inside train_data.
place one or more Portuges texts (in .txt format) in this new folder.

That's it.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test_data		test_data
train_data		train_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline.py		baseline.py
lang_iden.py		lang_iden.py
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lang_identification

Introduction

Usage

Adding language

About

Releases

Packages

Languages

License

vinhqdang/lang_identification

Folders and files

Latest commit

History

Repository files navigation

lang_identification

Introduction

Usage

Adding language

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages