-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConceptNetNumberbatch word embeddings support #14
base: master
Are you sure you want to change the base?
Conversation
Thanks for this. I think we might want to add more smarts to the return type. I do not like the use of It is also clear to me that as we add more embedding types, |
Thanks for the feedback. A few remarks:
|
I am not sure what you mean, the tests delete their downloads automatically.
Yes, ideally, we would just test on mini-datasets. |
On many unix-like systems |
@zgornel Thanks! I'll start taking a look at this too.
Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.
I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like. |
That is my understanding too.
|
src/conceptnet.jl
Outdated
cnt = 0 | ||
indices = Int[] | ||
for (index, row) in enumerate(data) | ||
word, embedding = _parseline(row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can probably get a small efficiency gain here if you wait to actually parse the rest of the line as floats until you know that you are looking at a "keep word" (inside the if
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thx!
src/conceptnet.jl
Outdated
open(file, "r") do fid | ||
vocab_size, vector_size = map(x->parse(Int,x), split(readline(fid))) | ||
max_stored_vocab_size = _get_vocab_size(vocab_size, max_vocab_size) | ||
data = readlines(fid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readlines
loads the whole file into memory. I think it would be better to remove this line and iterate through the file with enumerate(eachline(fid))
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, will do ;)
That's it, I was referring to the pretrained model which can be downloaded here. Since the multilingual conceptnet file uses a word of the form @oxinabox I was not aware that Languages.jl has language identification, that's great. |
8b57559
to
9d86441
Compare
This pull adds support for ConceptNetNumberbatch. Three distinct files formats are available and supported:
.txt
file, word and embeddings on each line.txt
file, word and embeddings on each lineVector{Int8}
Conceptnet word keys for the multilingual datasets are of the form
/c/<language>/word
which makes direct acces a bit unwieldy and searching for example forword
fails. Also, misspellings i.e.word.
,wordd
fails as well. A more heuristic method of retrieving the best match would be advised at this point :)