Tensorflow implementation of Word2Vec, a classic model for learning distributed word representation from large unlabeled dataset.
- Prepare your data: Your data should be one or more of text files where each line contains a sentence, and words are delimited by space.
- This implementation allows you to train the model under skip gram or continuous bag-of-words architectures (
--arch
), and perform training using negative sampling or hierarchical softmax (--algm
). To see a full list of parameters, runpython run_training.py --help
. - For example you can train your model with the following command:
python run_training.py --filenames=input/wiki1.txt,input/wiki2.txt --out_dir=output/ --window_size=5 --embed_size=300 --arch=skip_gram --algm=negative_sampling --batch_size=256
- The vocabulary words and word embeddings will be saved to
vocab.txt
andembed.npy
in the folder specified by--out_dir
in the previous step (can be loaded usingnp.load
).
- The package used to load the evaluation datasets uses
setuptools
. You can install it running:
python setup.py install
- If you have problems during this installation. First you may need to install the dependencies:
pip install -r requirements.txt
- To run the similarity evaluation use the following command:
python embedding_eval.py -e embedding/embed.npy -v vocabulary/vocab.txt -sv results/ -s
- You will find your results in the folder specified by
--results
- To run the analogy evaluation use the following command:
python embedding_eval.py -e embedding/embed.npy -v vocabulary/vocab.txt -sv results/ -a
- You will find your results in the folder specified by
--results
- To have show words similarities and analogies graphically, run the following command. You can further customize this file according to your own needs:
python show_similarities.py -e embedding/embed.npy -v vocabulary/vocab.txt
- Download Stanford’s Contextual Word Similarities (SCWS) at: http://ai.stanford.edu/~ehhuang/ and unzip it
- Run words disambiguation script with the following command:
word_sense_disambiguation.py -e embedding/embed.npy -v vocabulary/vocab.txt -save_path results/ --rating_path SCWS/ratings.txt
- If you want to tune more parameters you can use the command
word_sense_disambiguation.py --help
to see a list of them. - You will find your results in the folder specified by
--save_path
- https://github.com/chao-ji/tf-word2vec/blob/master/README.md
- https://github.com/kudkudak/word-embeddings-benchmarks
- https://github.com/logicalfarhad/word-sense-disambiguation/blob/master/word_sense.ipynb
- https://www.researchgate.net/publication/301403994_A_Unified_Model_for_Word_Sense_Representation_and_Disambiguation