Made off-by-one adjustments for specials tokens #41

agtsai-i · 2016-07-12T20:58:08Z

preprocess.tokenize() pads texts with -2 (the SKIP index), which puts it in the corpus vocabulary and counts_loose.

_loose_keys_ordered() then prepends the specials tokens (OOV and SKIP) while making keys_loose, thus allocating two array entries to SKIP (instead of 1 as desired, I assume).

This becomes a problem when you try to train a model using all of the words in the vocabulary, and in lda2vec_run.py,

model.sampler.W.data[:, :] = vectors[:n_vocab, :]

W is created with one more row than there are unique words + specials, since n_keys is derived from the concatenated array length created in _loose_keys_ordered(), and not the unique number of words in the vocabulary as created by counts_loose

Made off-by-one adjustments for specials tokens

bb49fb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Made off-by-one adjustments for specials tokens #41

Made off-by-one adjustments for specials tokens #41

agtsai-i commented Jul 12, 2016

Made off-by-one adjustments for specials tokens #41

Are you sure you want to change the base?

Made off-by-one adjustments for specials tokens #41

Conversation

agtsai-i commented Jul 12, 2016