Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made off-by-one adjustments for specials tokens #41

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agtsai-i
Copy link

preprocess.tokenize() pads texts with -2 (the SKIP index), which puts it in the corpus vocabulary and counts_loose.

_loose_keys_ordered() then prepends the specials tokens (OOV and SKIP) while making keys_loose, thus allocating two array entries to SKIP (instead of 1 as desired, I assume).

This becomes a problem when you try to train a model using all of the words in the vocabulary, and in lda2vec_run.py,

model.sampler.W.data[:, :] = vectors[:n_vocab, :]

W is created with one more row than there are unique words + specials, since n_keys is derived from the concatenated array length created in _loose_keys_ordered(), and not the unique number of words in the vocabulary as created by counts_loose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant