POS_tagger_lstm

POS tagger trained on Sequoia corpus with lstm (pytorch)

Given a Sequence of words (sentence), predict a sequence of part-of-speech tags.

Model 1: LSTM using one-hot vectors to encode words
Model 2: LSTM using pretrained word embeddings vectors to encode words

The baseline of this NLP task (Part-of-Speech tagging) is the Most Frequent Part-of-speech.

Requirements

Python 3.8.5
Pytorch

You need to download French word embeddings "vecs100-linear-frwiki" trained by M. Coavoux, via word2vec (skip-gram model) on the wikipedia dump (650 millions of words) frwiki-20140804-corpus.xml.bz2 (downloaded there http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/):

http://www.linguist.univ-paris-diderot.fr/~mcandito/vecs100-linear-frwiki.bz2

And put this file in the main repository.

NB: Other word embeddings are possible: the dimension of the word embeddings should be 100, and the file containing these word embeddings must be a text file with a word embedding per line, the token (word) and the float values (vector values, word embeddings) must be separated by spaces (first the token, then the float values).

Corpus

https://deep-sequoia.inria.fr/

Corentin Ribeyre, Marie Candito, et Djamé Seddah. 2014. Semi-Automatic Deep Syn- tactic Annotations of the French Treebank. Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories. Tübingen Universität, Tübingen, Germany

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

POS_tagger_lstm

Requirements

Corpus

Files

README.md

Latest commit

History

README.md

File metadata and controls

POS_tagger_lstm

Requirements

Corpus