Made off-by-one adjustments for specials tokens #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
preprocess.tokenize()
pads texts with -2 (the SKIP index), which puts it in the corpus vocabulary andcounts_loose
._loose_keys_ordered()
then prepends the specials tokens (OOV and SKIP) while makingkeys_loose
, thus allocating two array entries to SKIP (instead of 1 as desired, I assume).This becomes a problem when you try to train a model using all of the words in the vocabulary, and in lda2vec_run.py,
model.sampler.W.data[:, :] = vectors[:n_vocab, :]
W is created with one more row than there are unique words + specials, since
n_keys
is derived from the concatenated array length created in_loose_keys_ordered()
, and not the unique number of words in the vocabulary as created bycounts_loose