The adaptive-softmax project is a Torch implementation of the efficient softmax approximation for graphical processing units (GPU), described in the paper "Efficient softmax approximation for GPUs" (http://arxiv.org/abs/1609.04309).
This method is useful for training language models with large vocabularies. We provide a script to train large recurrent neural network language models, in order to reproduce the results of the paper.
This project depends on the following packages:
In order to train a recurrent neural network language model with default parameters, run
th train_big_lstm.lua -data DATA_DIR
where DATA_DIR
is a directory containing three text files, train.txt
,
valid.txt
and test.txt
.
In order to train a language model on PTB, run the command
th train_big_lstm.lua -data PATH/TO/PTB -nhid 512 -isz 512 -dropout 0.5 -usecudnn -cutoff 2000
In order to train a language model on text8, run the command
th train_big_lstm.lua -data PATH/TO/TEXT8 -nhid 512 -isz 512 -dropout 0.25 -batchsize 128 -usecudnn -cutoff 2000,10000
In order to train a language model on the billion word benchmark, run the command
th train_big_lstm.lua -data PATH/TO/BILLION/WORD -nhid 2048 -isz 256 -dropout 0.01 -batchsize 128 -testbatchsize 128 -threshold 2 -usecudnn -cutoff 4000,40000,200000
We now briefly discuss how to use the adaptive softmax in your own projects.
We provide a Torch layer called nn.AdaptiveSoftMax
and a corresponding
criterion, called nn.AdaptiveLoss
, which must be used when training with
the adaptive softmax. The vocabulary must be sorted by decreasing frequency,
so that frequent words correspond to small indices.
The constructor of the nn.AdaptiveSoftMax
layer takes two arguments:
hidden_size
, which is the size of the input of the adaptive softmax
and cutoff
, which is a table indicating the limits of the different clusters.
The constructor of the nn.AdaptiveLoss
criterion takes as only argument the
cutoff
table.
local nword = 44372
local hidden_size = 256
local cutoff = { 2000, 10000, nword }
local decoder = nn.AdaptiveSoftMax( hidden_size, cutoff )
local criterion = nn.AdaptiveLoss( cutoff )
In the previous example, we created an adaptive softmax with three clusters.
The first cluster contains the words from 1 to 2000, the second cluster
contains the words from 2001 to 10,000 and finally, the last cluster contains
the word from 10,001 to nword
.
The forward
method of the adaptive softmax takes a 2D tensor as input, and
output a table of 2D tensors of scores for each cluster (one tensor per
cluster). In order to be efficient, the nn.AdaptiveSoftMax
does not compute
the scores for all the word of the vocabulary for all the examples.It is thus
necessary to call the method setTarget
of the AdaptiveSoftMax
layer before
each forward pass:
decoder:setTarget( target )
where target is a 1D tensor. This ensure that the adaptive softmax will compute
the scores for the corresponding targets. It is also possible to call the method
getLogProb
, which computes the log probabilities for all the words of the
vocabulary, given a 2D tensor of hidden vectors.
See the CONTRIBUTING file for how to help out.
adaptive-softmax is BSD-licensed. We also provide an additional patent grant.