Skip to content

Commit

Permalink
integrate source for website
Browse files Browse the repository at this point in the history
Summary: See title.

Reviewed By: JoelMarcey

Differential Revision: D6459842

fbshipit-source-id: 149f1998df917d41d94b79a491a00b42b2e4bb0c
  • Loading branch information
cpuhrsch authored and facebook-github-bot committed Dec 6, 2017
1 parent fde065c commit 2fa6ae9
Show file tree
Hide file tree
Showing 414 changed files with 29,504 additions and 0 deletions.
22 changes: 22 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,27 @@ jobs:
command: |
. .circleci/gcc_test.sh
"website":
machine: true
machine:
node:
version: 6.10.3
npm:
version: 3.10.10

test:
override:
- "true"

deployment:
website:
branch: master
commands:
- git config --global user.email "[email protected]"
- git config --global user.name "Website Deployment Script"
- echo "machine github.com login cpuhrsch password $GITHUB_TOKEN" > ~/.netrc
- cd website && npm install && GIT_USER=cpuhrsch npm run publish-gh-pages

workflows:
version: 2
build:
Expand All @@ -136,3 +157,4 @@ workflows:
- "gcc6"
- "gcc7"
- "gcclatest"
- "website"
6 changes: 6 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
id: api
title:API
---

We automatically generate our [API documentation](/docs/en/html/index.html) with doxygen.
66 changes: 66 additions & 0 deletions docs/cheatsheet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
id: cheatsheet
title: Cheatsheet
---

## Word representation learning

In order to learn word vectors do:

```bash
$ ./fasttext skipgram -input data.txt -output model
```

## Obtaining word vectors

Print word vectors for a text file `queries.txt` containing words.

```bash
$ ./fasttext print-word-vectors model.bin < queries.txt
```

## Text classification

In order to train a text classifier do:

```bash
$ ./fasttext supervised -input train.txt -output model
```

Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

```bash
$ ./fasttext test model.bin test.txt 1
```

In order to obtain the k most likely labels for a piece of text, use:

```bash
$ ./fasttext predict model.bin test.txt k
```

In order to obtain the k most likely labels and their associated probabilities for a piece of text, use:

```bash
$ ./fasttext predict-prob model.bin test.txt k
```

If you want to compute vector representations of sentences or paragraphs, please use:

```bash
$ ./fasttext print-sentence-vectors model.bin < text.txt
```

## Quantization

In order to create a `.ftz` file with a smaller memory footprint do:

```bash
$ ./fasttext quantize -output model
```

All other commands such as test also work with this model

```bash
$ ./fasttext test model.ftz test.txt
```
6 changes: 6 additions & 0 deletions docs/dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
id: dataset
title: Datasets
---

[Download YFCC100M Dataset](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn)
28 changes: 28 additions & 0 deletions docs/english-vectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
id: english-vectors
title: English word vectors
---

This page gathers several pre-trained word vectors trained using fastText. More details will be added later.

### Download pre-trained word vectors

Pre-trained word vectors learned on different sources can be downloaded below:

1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip): 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip): 2 million word vectors trained on Common Crawl (600B tokens).

### Format

The first line of the file contains the number of words in the vocabulary and the size of the vectors.
Each line contains a word followed by its vectors, like in the default fastText text format.
Each value is space separated. Words are ordered by descending frequency.

### License

These word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).

### References

We are preparing a publication describing how these models were trained.
53 changes: 53 additions & 0 deletions docs/faqs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
id: faqs
title:FAQ
---

## What is fastText? Are there tutorials?

FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task. A few tutorials are available.

##Why are my fastText models that big?

fastText uses a hashtable for either word or character ngrams. The size of the hashtable direclty impacts the size of a model. To reduce the size of the model, it is possible to reduce the size of this table with the option '-hash'. For example a good value is 20000. Another option that greatly impacts the size of a model is the size of the vectors (-dim). This dimension can be reduced to save space but this can significantly impact performance. If that still produce a model that is too big, one can further reduce the size of a trained model with the quantization option.
```bash
./fasttext quantize -output model
```

##What would be the best way to represent word phrases rather than words?

Currently the best approach to represent word phrases or sentence is to take a bag of words of word vectors. Additionally, for phrases like “New York”, preprocessing the data so that it becomes a single token “New_York” can greatly help.

##Why does fastText produce vectors even for unknown words?

One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones.
Indeed, fastText word vectors are built from vectors of substrings of characters contained in it.
This allows to build vectors even for misspelled words or concatenation of words.

##Why is the hierarchical softmax slightly worse in performance than the full softmax?

The hierachical softmax is an approximation of the full softmax loss that allows to train on large number of class efficiently. This is often at the cost of a few percent of accuracy.
Note also that this loss is thought for classes that are unbalanced, that is some classes are more frequent than others. If your dataset has a balanced number of examples per class, it is worth trying the negative sampling loss (-loss ns -neg 100).
However, negative sampling will still be very slow at test time, since the full softmax will be computed.

##Can we run fastText program on a GPU?

FastText only works on CPU for accessibility. That being said, fastText has been implemented in the caffe2 library which can be run on GPU.

##Can I use fastText with python? Or other languages?

There are few unofficial wrappers for python or lua available on github.

##Can I use fastText with continuous data?

FastText works on discrete tokens and thus cannot be directly used on continuous tokens. However, one can discretize continuous tokens to use fastText on them, for example by rounding values to a specific digit ("12.3" becomes "12").

##There are misspellings in the dictionary. Should we improve text normalization?
If the words are infrequent, there is no need to worry.

##My compiler / architecture can't build fastText. What should I do?
Try a newer version of your compiler. We try to maintain compatibility with older versions of gcc and many platforms, however sometimes maintaining backwards compatibility becomes very hard. In general, compilers and tool chains that ship with LTS versions of major linux distributions should be fair game. In any case, create an issue with your compiler version and architecture and we'll try to implement compatibility.




47 changes: 47 additions & 0 deletions docs/language-identification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
id: language-identification
title: Language identification
---

### Description

We distribute two models for language identification, which can recognize 176 languages (see the list of ISO codes below). These models were trained on data from [Wikipedia](https://www.wikipedia.org/), [Tatoeba](https://tatoeba.org/eng/) and [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/), used under [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/).

We distribute two versions of the models:

* [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin), which is faster and slightly more accurate, but has a file size of 126MB ;
* [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz), which is the compressed version of the model, with a file size of 917kB.

These models were trained on UTF-8 data, and therefore expect UTF-8 as input.

### License

The models are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).

### List of supported languages
```
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
```

### References

If you use these models, please cite the following papers:

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, *Bag of Tricks for Efficient Text Classification* (https://arxiv.org/abs/1607.01759)
```
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
```
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, *FastText.zip: Compressing text classification models* (https://arxiv.org/abs/1612.03651)
```
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
```
50 changes: 50 additions & 0 deletions docs/options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
id: options
title: List of options
---

Invoke a command without arguments to list available arguments and their default values:

```bash
$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
-input training file path
-output output file path

The following arguments are optional:
-verbose verbosity level [2]

The following arguments for the dictionary are optional:
-minCount minimal number of word occurences [5]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [3]
-maxn max length of char ngram [6]
-t sampling threshold [0.0001]
-label labels prefix [__label__]

The following arguments for training are optional:
-lr learning rate [0.05]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax} [ns]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]

The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
```
Defaults may vary by mode. (Word-representation modes `skipgram` and `cbow` use a default `-minCount` of 5.)
Loading

0 comments on commit 2fa6ae9

Please sign in to comment.