diff --git a/docs/supervised-tutorial.md b/docs/supervised-tutorial.md index cdc4f29f3..1a45f8197 100644 --- a/docs/supervised-tutorial.md +++ b/docs/supervised-tutorial.md @@ -51,7 +51,7 @@ The commands supported by fasttext are: ``` -In this tutorial, we mainly use the `supervised`, `test` and `predict` subcommands, which corresponds to learning (and using) text classifier. For an introduction to the other functionalities of fastText, please see the [tutorial about learning word vectors](https://github.com/facebookresearch/fastText/blob/master/tutorials/unsupervised-learning.md). +In this tutorial, we mainly use the `supervised`, `test` and `predict` subcommands, which corresponds to learning (and using) text classifier. For an introduction to the other functionalities of fastText, please see the [tutorial about learning word vectors](https://fasttext.cc/docs/en/unsupervised-tutorial.html). ## Getting and preparing the data diff --git a/docs/unsupervised-tutorials.md b/docs/unsupervised-tutorials.md index 12ddcfce6..bc79d7761 100644 --- a/docs/unsupervised-tutorials.md +++ b/docs/unsupervised-tutorials.md @@ -4,9 +4,9 @@ title: Word representations --- A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. -In this tutorial, we show how to build these word vectors with the fastText tool. To download and install fastText, follow the first steps of [the tutorial on text classification](https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md). +In this tutorial, we show how to build these word vectors with the fastText tool. To download and install fastText, follow the first steps of [the tutorial on text classification](https://fasttext.cc/docs/en/supervised-tutorial.html). -# Getting the data +## Getting the data In order to compute word vectors, you need a large text corpus. Depending on the corpus, the word vectors will capture different information. In this tutorial, we focus on Wikipedia's articles but other sources could be considered, like news or Webcrawl (more examples [here](http://statmt.org/)). To download a raw dump of Wikipedia, run the following command: @@ -22,7 +22,7 @@ $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data $ unzip data/enwik9.zip -d data ``` -A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) ) +A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) ) ```bash $ perl wikifil.pl data/enwik9 > data/fil9 @@ -37,7 +37,7 @@ anarchism originated as a term of abuse first used against early working class The text is nicely pre-processed and can be used to learn our word vectors. -# Training word vectors +## Training word vectors Learning word vectors on this data can now be achieved with a single command: @@ -68,7 +68,7 @@ one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 .. The first line is a header containing the number of words and the dimensionality of the vectors. The subsequent lines are the word vectors for all words in the vocabulary, sorted by decreasing frequency. -## Advanced readers: skipgram versus cbow +### Advanced readers: skipgram versus cbow fastText provides two models for computing word representations: skipgram and cbow ('**c**ontinuous-**b**ag-**o**f-**w**ords'). @@ -76,7 +76,7 @@ The skipgram model learns to predict a target word thanks to a nearby word. On t Let us illustrate this difference with an example: given the sentence *'Poets have been mysteriously silent on the subject of cheese'* and the target word '*silent*', a skipgram model tries to predict the target using a random close-by word, like '*subject' *or* '*mysteriously*'**. *The cbow model takes all the words in a surrounding window, like {*been, *mysteriously*, on, the*}, and uses the sum of their vectors to predict the target. The figure below summarizes this difference with another example. -![cbow vs skipgram](https://github.com/facebookresearch/fastText/blob/master/tutorials/cbo_vs_skipgram.png) +![cbow vs skipgram](https://fasttext.cc/img/cbo_vs_skipgram.png) To train a cbow model with fastText, you run the following command: ```bash @@ -86,7 +86,7 @@ To train a cbow model with fastText, you run the following command: In practice, we observe that skipgram models works better with subword information than cbow. -## Advanced readers: playing with the parameters +### Advanced readers: playing with the parameters So far, we run fastText with the default parameters, but depending on the data, these parameters may not be optimal. Let us give an introduction to some of the key parameters for word vectors. @@ -110,7 +110,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4 -# Printing word vectors +## Printing word vectors Searching and printing word vectors directly from the `fil9.vec` file is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText. @@ -134,7 +134,7 @@ $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin You still get a word vector for it! But how good it is? Let s find out in the next sections! -# Nearest neighbor queries +## Nearest neighbor queries A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture. @@ -146,7 +146,7 @@ Pre-computing word vectors... done. ``` - Then we are prompted to type our query word, let us try *asparagus* : +Then we are prompted to type our query word, let us try *asparagus* : ```bash Query word? asparagus @@ -196,11 +196,11 @@ ecotourism 0.697081 Thanks to the information contained within the word, the vector of our misspelled word matches to reasonable words! It is not perfect but the main information has been captured. -## Advanced reader: measure of similarity +### Advanced reader: measure of similarity In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1. -# Word analogies +## Word analogies In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany. @@ -241,7 +241,7 @@ famicom 0.745298 Our model considers that the *nintendo* analogy of a *psx* is the *gamecube*, which seems reasonable. Of course the quality of the analogies depend on the dataset used to train the model and one can only hope to cover fields only in the dataset. -# Importance of character n-grams +## Importance of character n-grams Using subword-level information is particularly interesting to build vectors for unknown words. For example, the word *gearshift* does not exist on Wikipedia but we can still query its closest existing words: @@ -304,6 +304,6 @@ hospitality 0.701426 The nearest neighbors capture different variation around the word *accommodation*. We also get semantically related words such as *amenities* or *lodging*. -# Conclusion +## Conclusion -In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you can find pre-trained models with the default setting for 294 of them [here](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) +In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them. diff --git a/tutorials/supervised-learning.md b/tutorials/supervised-learning.md deleted file mode 100644 index 202bb9c79..000000000 --- a/tutorials/supervised-learning.md +++ /dev/null @@ -1,286 +0,0 @@ -# Learning a text classifier using fastText - -Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. In this tutorial, we describe how to build a text classifier with the fastText tool. - -## What is text classification? - -The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels). - -As an example, we build a classifier which automatically classifies stackexchange questions about cooking into one of several possible tags, such as `pot`, `bowl` or `baking`. - -## Installing fastText - -The first step of this tutorial is to install and build fastText. It only requires a c++ compiler with good support of c++11. - -Let us start by downloading the [most recent release](https://github.com/facebookresearch/fastText/releases): - -``` -$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip -$ unzip v0.1.0.zip -``` - -Move to the fastText directory and build it: - -``` -$ cd fastText-0.1.0 -$ make -``` - -Running the binary without any argument will print the high level documentation, showing the different usecases supported by fastText: - -``` ->> ./fasttext -usage: fasttext - -The commands supported by fasttext are: - - supervised train a supervised classifier - quantize quantize a model to reduce the memory usage - test evaluate a supervised classifier - predict predict most likely labels - predict-prob predict most likely labels with probabilities - skipgram train a skipgram model - cbow train a cbow model - print-word-vectors print word vectors given a trained model - print-sentence-vectors print sentence vectors given a trained model - nn query for nearest neighbors - analogies query for analogies - -``` - -In this tutorial, we mainly use the `supervised`, `test` and `predict` subcommands, which corresponds to learning (and using) text classifier. For an introduction to the other functionalities of fastText, please see the [tutorial about learning word vectors](https://github.com/facebookresearch/fastText/blob/master/tutorials/unsupervised-learning.md). - -## Getting and preparing the data - -As mentioned in the introduction, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions from [the cooking section of Stackexchange](http://cooking.stackexchange.com/), and their associated tags: - -``` ->> wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz ->> head cooking.stackexchange.txt -``` - -Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the `__label__` prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document. - -Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data. - -``` ->> wc cooking.stackexchange.txt - 15404 169582 1401900 cooking.stackexchange.txt -``` - -Our full dataset contains 15404 examples. Let's split it into a training set of 12404 examples and a validation set of 3000 examples: - -``` ->> head -n 12404 cooking.stackexchange.txt > cooking.train ->> tail -n 3000 cooking.stackexchange.txt > cooking.valid -``` - -## Our first classifier - -We are now ready to train our first classifier: - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -Read 0M words -Number of words: 14598 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 75109 lr: 0.000000 loss: 5.708354 eta: 0h0m -``` - -The `-input` command line option indicates the file containing the training examples, while the `-output` option indicates where to save the model. At the end of training, a file `model_cooking.bin`, containing the trained classifier, is created in the current directory. - -It is possible to directly test our classifier interactively, by running the command: - -``` ->> ./fasttext predict model_cooking.bin - -``` - -and then typing a sentence. Let's first try the sentence: - -*Which baking dish is best to bake a banana bread ?* - -The predicted tag is `baking` which fits well to this question. Let us now try a second example: - -*Why not put knives in the dishwasher?* - -The label predicted by the model is `food-safety`, which is not relevant. Somehow, the model seems to fail on simple examples. To get a better sense of its quality, let's test it on the validation data by running: - -``` ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.124 -R@1 0.0541 -Number of examples: 3000 -``` - -The output of fastText are the precision at one (`P@1`) and the recall at one (`R@1`). We can also compute the precision at five and recall at five with: - -``` ->> ./fasttext test model_cooking.bin cooking.valid 5 -N 3000 -P@5 0.0668 -R@5 0.146 -Number of examples: 3000 -``` - -### Advanced reader: precision and recall - -The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted, among all the real labels. Let's take an example to make this more clear: - -*Why not put knives in the dishwasher?* - -On Stack Exchange, this sentence is labeled with three tags: `equipment`, `cleaning` and `knives`. The top five labels predicted by the model can be obtained with: - -``` ->> ./fasttext predict model_cooking.bin - 5 -``` - -are `food-safety`, `baking`, `equipment`, `substitutions` and `bread`. - -Thus, one out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33. - -For more details, see [the related Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall). - -## Making the model better - -The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters. - -### preprocessing the data - -Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as `sed` and `tr`: - -``` ->> cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt ->> head -n 12404 cooking.preprocessed.txt > cooking.train ->> tail -n 3000 cooking.preprocessed.txt > cooking.valid -``` - -Let's train a new model on the pre-processed data: - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 82041 lr: 0.000000 loss: 5.671649 eta: 0h0m h-14m - ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.164 -R@1 0.0717 -Number of examples: 3000 -``` - -We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%! - -### more epochs and larger learning rate - -By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the `-epoch` option: - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -epoch 25 -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 77633 lr: 0.000000 loss: 7.147976 eta: 0h0m -``` - -Let's test the new model: - -``` ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.501 -R@1 0.218 -Number of examples: 3000 -``` - -This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range `0.1 - 1.0`. - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m - ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.563 -R@1 0.245 -Number of examples: 3000 -``` - -Even better! Let's try both together: - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 76394 lr: 0.000000 loss: 4.350277 eta: 0h0m - ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.585 -R@1 0.255 -Number of examples: 3000 -``` - -Let us now add a few more features to improve even further our performance! - -### word n-grams - -Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis. - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 75366 lr: 0.000000 loss: 3.226064 eta: 0h0m - ->> ./fasttext test model_cooking.bin cooking.valid -N 3000 -P@1 0.599 -R@1 0.261 -Number of examples: 3000 -``` - -With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included: - -* preprocessing the data ; -* changing the number of epochs (using the option `-epoch`, standard range `[5 - 50]`) ; -* changing the learning rate (using the option `-lr`, standard range `[0.1 - 1.0]`) ; -* using word n-grams (using the option `-wordNgrams`, standard range `[1 - 5]`). - -### Advanced readers: What is a Bigram? - -A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words. - -Similarly we denote by 'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens. - -For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'. - -Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams. - -Let us illustrate this by a simple exercise, given the following bigrams, try to reconstruct the original sentence: 'all out', 'I am', 'of bubblegum', 'out of' and 'am all'. -It is common to refer to a word as a unigram. - -## Scaling things up - -Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax [Add a quick explanation of the hierarchical softmax]. This can be done with the option `-loss hs`: - -``` ->> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs -Read 0M words -Number of words: 9012 -Number of labels: 734 -Progress: 100.0% words/sec/thread: 2199406 lr: 0.000000 loss: 1.718807 eta: 0h0m -``` - -Training should now take less than a second. - -## Conclusion - -In this tutorial, we gave a brief overview of how to use fastText to train powerful text classifiers. We had a light overview of some of the most important options to tune. diff --git a/tutorials/unsupervised-learning.md b/tutorials/unsupervised-learning.md deleted file mode 100644 index 6bc00d33e..000000000 --- a/tutorials/unsupervised-learning.md +++ /dev/null @@ -1,307 +0,0 @@ -# Learning word representations using fastText - -A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. - -In this tutorial, we show how to build these word vectors with the fastText tool. To download and install fastText, follow the first steps of [the tutorial on text classification](https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md). - -# Getting the data - -In order to compute word vectors, you need a large text corpus. Depending on the corpus, the word vectors will capture different information. In this tutorial, we focus on Wikipedia's articles but other sources could be considered, like news or Webcrawl (more examples [here](http://statmt.org/)). To download a raw dump of Wikipedia, run the following command: - -``` -wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 -``` - -Downloading the Wikipedia corpus takes some time. Instead, lets restrict our study to the first 1 billion bytes of English Wikipedia. They can be found on Matt Mahoney's [website](http://mattmahoney.net/): - -``` -$ mkdir data -$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data -$ unzip data/enwik9.zip -d data -``` - -A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) ) - -``` -$ perl wikifil.pl data/enwik9 > data/fil9 -``` - -We can check the file by running the following command: - -``` -$ head -c 80 data/fil9 -anarchism originated as a term of abuse first used against early working class -``` - -The text is nicely pre-processed and can be used to learn our word vectors. - -# Training word vectors - -Learning word vectors on this data can now be achieved with a single command: - -``` -$ mkdir result -$ ./fasttext skipgram -input data/fil9 -output result/fil9 -``` - -To decompose this command line: ./fastext calls the binary fastText executable (see how to install fastText here) with the 'skipgram' model (it can also be 'cbow'). We then specify the requires options '-input' for the location of the data and '-output' for the location where the word representations will be saved. - -While fastText is running, the progress and estimated time to completion is shown on your screen. Once the program finishes, there should be two files in the result directory: - -``` -$ ls -l result --rw-r-r-- 1 bojanowski 1876110778 978480850 Dec 20 11:01 fil9.bin --rw-r-r-- 1 bojanowski 1876110778 190004182 Dec 20 11:01 fil9.vec -``` - -The `fil9.bin` file is a binary file that stores the whole fastText model and can be subsequently loaded. The `fil9.vec` file is a text file that contains the word vectors, one per line for each word in the vocabulary: - -``` -$ head -n 4 result/fil9.vec -218316 100 -the -0.10363 -0.063669 0.032436 -0.040798 0.53749 0.00097867 0.10083 0.24829 ... -of -0.0083724 0.0059414 -0.046618 -0.072735 0.83007 0.038895 -0.13634 0.60063 ... -one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 ... -``` - -The first line is a header containing the number of words and the dimensionality of the vectors. The subsequent lines are the word vectors for all words in the vocabulary, sorted by decreasing frequency. - -## Advanced readers: skipgram versus cbow - -fastText provides two models for computing word representations: skipgram and cbow ('**c**ontinuous-**b**ag-**o**f-**w**ords'). - -The skipgram model learns to predict a target word thanks to a nearby word. On the other hand, the cbow model predicts the target word according to its context. The context is represented as a bag of the words contained in a fixed size window around the target word. - -Let us illustrate this difference with an example: given the sentence *'Poets have been mysteriously silent on the subject of cheese'* and the target word '*silent*', a skipgram model tries to predict the target using a random close-by word, like '*subject' *or* '*mysteriously*'**. *The cbow model takes all the words in a surrounding window, like {*been, *mysteriously*, on, the*}, and uses the sum of their vectors to predict the target. The figure below summarizes this difference with another example. - -![cbow vs skipgram](https://github.com/facebookresearch/fastText/blob/master/tutorials/cbo_vs_skipgram.png) -To train a cbow model with fastText, you run the following command: - -``` -./fasttext cbow -input data/fil9 -output result/fil9 -``` - - -In practice, we observe that skipgram models works better with subword information than cbow. - -## Advanced readers: playing with the parameters - -So far, we run fastText with the default parameters, but depending on the data, these parameters may not be optimal. Let us give an introduction to some of the key parameters for word vectors. - -The most important parameters of the model are its dimension and the range of size for the subwords. The dimension (*dim*) controls the size of the vectors, the larger they are the more information they can capture but requires more data to be learned. But, if they are too large, they are harder and slower to train. By default, we use 100 dimensions, but any value in the 100-300 range is as popular. The subwords are all the substrings contained in a word between the minimum size (*minn*) and the maximal size (*maxn*). By default, we take all the subword between 3 and 6 characters, but other range could be more appropriate to different languages: - -``` -$ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300 -``` - -Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]: - -``` -$ ./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5 -``` - -Finally , fastText is multi-threaded and uses 12 threads by default. If you have less CPU cores (say 4), you can easily set the number of threads using the *thread* flag: - -``` -$ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4 -``` - - - -# Printing word vectors - -Searching and printing word vectors directly from the `fil9.vec` file is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText. - -For examples, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command: - -``` -$ echo "asparagus pidgey yellow" | ./fasttext print-word-vectors result/fil9.bin -asparagus 0.46826 -0.20187 -0.29122 -0.17918 0.31289 -0.31679 0.17828 -0.04418 ... -pidgey -0.16065 -0.45867 0.10565 0.036952 -0.11482 0.030053 0.12115 0.39725 ... -yellow -0.39965 -0.41068 0.067086 -0.034611 0.15246 -0.12208 -0.040719 -0.30155 ... -``` - -A nice feature is that you can also query for words that did not appear in your data! Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it! - -As an example let's try with a misspelled word: - -``` -$ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin -``` - -You still get a word vector for it! But how good it is? Let s find out in the next sections! - - -# Nearest neighbor queries - -A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture. - -This can be achieve with the *nn *functionality. For example, we can query the 10 nearest neighbors of a word by running the following command: - -``` -$ ./fasttext nn result/fil9.bin -Pre-computing word vectors... done. -``` - - - Then we are prompted to type our query word, let us try *asparagus* : - -``` -Query word? asparagus -beetroot 0.812384 -tomato 0.806688 -horseradish 0.805928 -spinach 0.801483 -licorice 0.791697 -lingonberries 0.781507 -asparagales 0.780756 -lingonberry 0.778534 -celery 0.774529 -beets 0.773984 -``` - -Nice! It seems that vegetable vectors are similar. Note that the nearest neighbor is the word *asparagus* itself, this means that this word appeared in the dataset. What about pokemons? - -``` -Query word? pidgey -pidgeot 0.891801 -pidgeotto 0.885109 -pidge 0.884739 -pidgeon 0.787351 -pok 0.781068 -pikachu 0.758688 -charizard 0.749403 -squirtle 0.742582 -beedrill 0.741579 -charmeleon 0.733625 -``` - -Different evolution of the same Pokemon have close-by vectors! But what about our misspelled word, is its vector close to anything reasonable? Let s find out: - -``` -Query word? enviroment -enviromental 0.907951 -environ 0.87146 -enviro 0.855381 -environs 0.803349 -environnement 0.772682 -enviromission 0.761168 -realclimate 0.716746 -environment 0.702706 -acclimatation 0.697196 -ecotourism 0.697081 -``` - -Thanks to the information contained within the word, the vector of our misspelled word matches to reasonable words! It is not perfect but the main information has been captured. - -## Advanced reader: measure of similarity - -In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1. - -# Word analogies - -In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany. - -This can be done with the *analogies *functionality. It takes a word triplet (like *Germany Berlin France*) and outputs the analogy: - -``` -$ ./fasttext analogies result/fil9.bin -Pre-computing word vectors... done. -Query triplet (A - B + C)? berlin germany france -paris 0.896462 -bourges 0.768954 -louveciennes 0.765569 -toulouse 0.761916 -valenciennes 0.760251 -montpellier 0.752747 -strasbourg 0.744487 -meudon 0.74143 -bordeaux 0.740635 -pigneaux 0.736122 -``` - -The answer provides by our model is *Paris*, which is correct. Let's have a look at a less obvious example: - -``` -Query triplet (A - B + C)? psx sony nintendo -gamecube 0.803352 -nintendogs 0.792646 -playstation 0.77344 -sega 0.772165 -gameboy 0.767959 -arcade 0.754774 -playstationjapan 0.753473 -gba 0.752909 -dreamcast 0.74907 -famicom 0.745298 -``` - -Our model considers that the *nintendo* analogy of a *psx* is the *gamecube*, which seems reasonable. Of course the quality of the analogies depend on the dataset used to train the model and one can only hope to cover fields only in the dataset. - - -# Importance of character n-grams - -Using subword-level information is particularly interesting to build vectors for unknown words. For example, the word *gearshift* does not exist on Wikipedia but we can still query its closest existing words: - -``` -Query word? gearshift -gearing 0.790762 -flywheels 0.779804 -flywheel 0.777859 -gears 0.776133 -driveshafts 0.756345 -driveshaft 0.755679 -daisywheel 0.749998 -wheelsets 0.748578 -epicycles 0.744268 -gearboxes 0.73986 -``` - -Most of the retrieved words share substantial substrings but a few are actually quite different, like *cogwheel*. You can try other words like *sunbathe* or *grandnieces*. - -Now that we have seen the interest of subword information for unknown words, let s check how it compares to a model that do not use subword information. To train a model without no subwords, just run the following command: - -``` -$ ./fasttext skipgram -input data/fil9 -output result/fil9-none -maxn 0 -``` - -The results are saved in result/fil9-non.vec and result/fil9-non.bin. - -To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation**.* Here is the nearest neighbors obtained without no subwords: - -``` -$ ./fasttext nn result/fil9-none.bin -Query word? accomodation -sunnhordland 0.775057 -accomodations 0.769206 -administrational 0.753011 -laponian 0.752274 -ammenities 0.750805 -dachas 0.75026 -vuosaari 0.74172 -hostelling 0.739995 -greenbelts 0.733975 -asserbo 0.732465 -``` - -The result does not make much sense, most of these words are unrelated. On the other hand, using subword information gives the following list of nearest neighbors: - -``` -Query word? accomodation -accomodations 0.96342 -accommodation 0.942124 -accommodations 0.915427 -accommodative 0.847751 -accommodating 0.794353 -accomodated 0.740381 -amenities 0.729746 -catering 0.725975 -accomodate 0.703177 -hospitality 0.701426 -``` - -The nearest neighbors capture different variation around the word *accommodation*. We also get semantically related words such as *amenities* or *lodging*. - -# Conclusion - -In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you can find pre-trained models with the default setting for 294 of them [here](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) diff --git a/website/blog/2017-05-02-blog-post.md b/website/blog/2017-05-02-blog-post.md index 0e2f9fa2b..736456afd 100755 --- a/website/blog/2017-05-02-blog-post.md +++ b/website/blog/2017-05-02-blog-post.md @@ -54,7 +54,7 @@ fastText is designed to be extremely fast. This guarantees the responsiveness th In the second tutorial, fastText is used to learn word representations from Wikipedia pages. The tutorial steps through simple ways to test the quality of a model. Queries return a word’s nearest neighbors or given a related pair example, analogies produce the most closely related words to a a queried word. For example, a model can predict that Paris is related to France in the same way as Berlin to Germany. Even words that the model has not been trained on can be tested! fastText looks at groups of characters that build-up the word to produce its representation to find likely candidates for misspelled words and made-up words like ”shiftgear.” -Students and developers interested in machine learning can get right to work with the newly released self-paced tutorials [available on Github](https://github.com/facebookresearch/fastText/tree/master/tutorials). The tutorials are straightforward and do not require advanced knowledge in machine learning. The tutorials also offer insights into other features of the fastText library for more advanced developers. +Students and developers interested in machine learning can get right to work with the newly released self-paced tutorials [available on our website](https://fasttext.cc/docs/en/supervised-tutorial.html). The tutorials are straightforward and do not require advanced knowledge in machine learning. The tutorials also offer insights into other features of the fastText library for more advanced developers. Use cases include experimentation, prototyping, and production. fastText can be used as a command line, linked to a C++ application, or used as a library. Community contributed Python and Lua APIs are also available. diff --git a/tutorials/cbo_vs_skipgram.png b/website/static/img/cbo_vs_skipgram.png similarity index 100% rename from tutorials/cbo_vs_skipgram.png rename to website/static/img/cbo_vs_skipgram.png