forked from facebookresearch/fastText
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: See title. Reviewed By: JoelMarcey Differential Revision: D6459842 fbshipit-source-id: 149f1998df917d41d94b79a491a00b42b2e4bb0c
- Loading branch information
1 parent
fde065c
commit 2fa6ae9
Showing
414 changed files
with
29,504 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -123,6 +123,27 @@ jobs: | |
command: | | ||
. .circleci/gcc_test.sh | ||
"website": | ||
machine: true | ||
machine: | ||
node: | ||
version: 6.10.3 | ||
npm: | ||
version: 3.10.10 | ||
|
||
test: | ||
override: | ||
- "true" | ||
|
||
deployment: | ||
website: | ||
branch: master | ||
commands: | ||
- git config --global user.email "[email protected]" | ||
- git config --global user.name "Website Deployment Script" | ||
- echo "machine github.com login cpuhrsch password $GITHUB_TOKEN" > ~/.netrc | ||
- cd website && npm install && GIT_USER=cpuhrsch npm run publish-gh-pages | ||
|
||
workflows: | ||
version: 2 | ||
build: | ||
|
@@ -136,3 +157,4 @@ workflows: | |
- "gcc6" | ||
- "gcc7" | ||
- "gcclatest" | ||
- "website" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
id: api | ||
title:API | ||
--- | ||
|
||
We automatically generate our [API documentation](/docs/en/html/index.html) with doxygen. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
id: cheatsheet | ||
title: Cheatsheet | ||
--- | ||
|
||
## Word representation learning | ||
|
||
In order to learn word vectors do: | ||
|
||
```bash | ||
$ ./fasttext skipgram -input data.txt -output model | ||
``` | ||
|
||
## Obtaining word vectors | ||
|
||
Print word vectors for a text file `queries.txt` containing words. | ||
|
||
```bash | ||
$ ./fasttext print-word-vectors model.bin < queries.txt | ||
``` | ||
|
||
## Text classification | ||
|
||
In order to train a text classifier do: | ||
|
||
```bash | ||
$ ./fasttext supervised -input train.txt -output model | ||
``` | ||
|
||
Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using: | ||
|
||
```bash | ||
$ ./fasttext test model.bin test.txt 1 | ||
``` | ||
|
||
In order to obtain the k most likely labels for a piece of text, use: | ||
|
||
```bash | ||
$ ./fasttext predict model.bin test.txt k | ||
``` | ||
|
||
In order to obtain the k most likely labels and their associated probabilities for a piece of text, use: | ||
|
||
```bash | ||
$ ./fasttext predict-prob model.bin test.txt k | ||
``` | ||
|
||
If you want to compute vector representations of sentences or paragraphs, please use: | ||
|
||
```bash | ||
$ ./fasttext print-sentence-vectors model.bin < text.txt | ||
``` | ||
|
||
## Quantization | ||
|
||
In order to create a `.ftz` file with a smaller memory footprint do: | ||
|
||
```bash | ||
$ ./fasttext quantize -output model | ||
``` | ||
|
||
All other commands such as test also work with this model | ||
|
||
```bash | ||
$ ./fasttext test model.ftz test.txt | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
id: dataset | ||
title: Datasets | ||
--- | ||
|
||
[Download YFCC100M Dataset](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
id: english-vectors | ||
title: English word vectors | ||
--- | ||
|
||
This page gathers several pre-trained word vectors trained using fastText. More details will be added later. | ||
|
||
### Download pre-trained word vectors | ||
|
||
Pre-trained word vectors learned on different sources can be downloaded below: | ||
|
||
1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). | ||
2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip): 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). | ||
3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip): 2 million word vectors trained on Common Crawl (600B tokens). | ||
|
||
### Format | ||
|
||
The first line of the file contains the number of words in the vocabulary and the size of the vectors. | ||
Each line contains a word followed by its vectors, like in the default fastText text format. | ||
Each value is space separated. Words are ordered by descending frequency. | ||
|
||
### License | ||
|
||
These word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
### References | ||
|
||
We are preparing a publication describing how these models were trained. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
id: faqs | ||
title:FAQ | ||
--- | ||
|
||
## What is fastText? Are there tutorials? | ||
|
||
FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task. A few tutorials are available. | ||
|
||
##Why are my fastText models that big? | ||
|
||
fastText uses a hashtable for either word or character ngrams. The size of the hashtable direclty impacts the size of a model. To reduce the size of the model, it is possible to reduce the size of this table with the option '-hash'. For example a good value is 20000. Another option that greatly impacts the size of a model is the size of the vectors (-dim). This dimension can be reduced to save space but this can significantly impact performance. If that still produce a model that is too big, one can further reduce the size of a trained model with the quantization option. | ||
```bash | ||
./fasttext quantize -output model | ||
``` | ||
|
||
##What would be the best way to represent word phrases rather than words? | ||
|
||
Currently the best approach to represent word phrases or sentence is to take a bag of words of word vectors. Additionally, for phrases like “New York”, preprocessing the data so that it becomes a single token “New_York” can greatly help. | ||
|
||
##Why does fastText produce vectors even for unknown words? | ||
|
||
One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. | ||
Indeed, fastText word vectors are built from vectors of substrings of characters contained in it. | ||
This allows to build vectors even for misspelled words or concatenation of words. | ||
|
||
##Why is the hierarchical softmax slightly worse in performance than the full softmax? | ||
|
||
The hierachical softmax is an approximation of the full softmax loss that allows to train on large number of class efficiently. This is often at the cost of a few percent of accuracy. | ||
Note also that this loss is thought for classes that are unbalanced, that is some classes are more frequent than others. If your dataset has a balanced number of examples per class, it is worth trying the negative sampling loss (-loss ns -neg 100). | ||
However, negative sampling will still be very slow at test time, since the full softmax will be computed. | ||
|
||
##Can we run fastText program on a GPU? | ||
|
||
FastText only works on CPU for accessibility. That being said, fastText has been implemented in the caffe2 library which can be run on GPU. | ||
|
||
##Can I use fastText with python? Or other languages? | ||
|
||
There are few unofficial wrappers for python or lua available on github. | ||
|
||
##Can I use fastText with continuous data? | ||
|
||
FastText works on discrete tokens and thus cannot be directly used on continuous tokens. However, one can discretize continuous tokens to use fastText on them, for example by rounding values to a specific digit ("12.3" becomes "12"). | ||
|
||
##There are misspellings in the dictionary. Should we improve text normalization? | ||
If the words are infrequent, there is no need to worry. | ||
|
||
##My compiler / architecture can't build fastText. What should I do? | ||
Try a newer version of your compiler. We try to maintain compatibility with older versions of gcc and many platforms, however sometimes maintaining backwards compatibility becomes very hard. In general, compilers and tool chains that ship with LTS versions of major linux distributions should be fair game. In any case, create an issue with your compiler version and architecture and we'll try to implement compatibility. | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
--- | ||
id: language-identification | ||
title: Language identification | ||
--- | ||
|
||
### Description | ||
|
||
We distribute two models for language identification, which can recognize 176 languages (see the list of ISO codes below). These models were trained on data from [Wikipedia](https://www.wikipedia.org/), [Tatoeba](https://tatoeba.org/eng/) and [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/), used under [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
We distribute two versions of the models: | ||
|
||
* [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin), which is faster and slightly more accurate, but has a file size of 126MB ; | ||
* [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz), which is the compressed version of the model, with a file size of 917kB. | ||
|
||
These models were trained on UTF-8 data, and therefore expect UTF-8 as input. | ||
|
||
### License | ||
|
||
The models are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
### List of supported languages | ||
``` | ||
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh | ||
``` | ||
|
||
### References | ||
|
||
If you use these models, please cite the following papers: | ||
|
||
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, *Bag of Tricks for Efficient Text Classification* (https://arxiv.org/abs/1607.01759) | ||
``` | ||
@article{joulin2016bag, | ||
title={Bag of Tricks for Efficient Text Classification}, | ||
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, | ||
journal={arXiv preprint arXiv:1607.01759}, | ||
year={2016} | ||
} | ||
``` | ||
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, *FastText.zip: Compressing text classification models* (https://arxiv.org/abs/1612.03651) | ||
``` | ||
@article{joulin2016fasttext, | ||
title={FastText.zip: Compressing text classification models}, | ||
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, | ||
journal={arXiv preprint arXiv:1612.03651}, | ||
year={2016} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
id: options | ||
title: List of options | ||
--- | ||
|
||
Invoke a command without arguments to list available arguments and their default values: | ||
|
||
```bash | ||
$ ./fasttext supervised | ||
Empty input or output path. | ||
|
||
The following arguments are mandatory: | ||
-input training file path | ||
-output output file path | ||
|
||
The following arguments are optional: | ||
-verbose verbosity level [2] | ||
|
||
The following arguments for the dictionary are optional: | ||
-minCount minimal number of word occurences [5] | ||
-minCountLabel minimal number of label occurences [0] | ||
-wordNgrams max length of word ngram [1] | ||
-bucket number of buckets [2000000] | ||
-minn min length of char ngram [3] | ||
-maxn max length of char ngram [6] | ||
-t sampling threshold [0.0001] | ||
-label labels prefix [__label__] | ||
|
||
The following arguments for training are optional: | ||
-lr learning rate [0.05] | ||
-lrUpdateRate change the rate of updates for the learning rate [100] | ||
-dim size of word vectors [100] | ||
-ws size of the context window [5] | ||
-epoch number of epochs [5] | ||
-neg number of negatives sampled [5] | ||
-loss loss function {ns, hs, softmax} [ns] | ||
-thread number of threads [12] | ||
-pretrainedVectors pretrained word vectors for supervised learning [] | ||
-saveOutput whether output params should be saved [0] | ||
|
||
The following arguments for quantization are optional: | ||
-cutoff number of words and ngrams to retain [0] | ||
-retrain finetune embeddings if a cutoff is applied [0] | ||
-qnorm quantizing the norm separately [0] | ||
-qout quantizing the classifier [0] | ||
-dsub size of each sub-vector [2] | ||
``` | ||
Defaults may vary by mode. (Word-representation modes `skipgram` and `cbow` use a default `-minCount` of 5.) | ||
Oops, something went wrong.