Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding spacy-universal-sentence-encoder #5534

Merged
merged 3 commits into from
Jun 8, 2020
Merged

adding spacy-universal-sentence-encoder #5534

merged 3 commits into from
Jun 8, 2020

Conversation

MartinoMensio
Copy link
Contributor

Description

This PR adds to the spaCy Universe the wrapper I created for using in spaCy the Universal Sentence Encoder hosted on Tensorflow Hub https://tfhub.dev/google/collections/universal-sentence-encoder/1

It uses pipeline components to substitute the vector of documents, spans and tokens with a hook that computes the vector from the TensorFlow Hub model.

For more details, see https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/

Types of change

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added the docs Documentation and website label Jun 2, 2020
@adrianeboyd
Copy link
Contributor

Thanks for this contribution!

@adrianeboyd adrianeboyd merged commit de00f96 into explosion:master Jun 8, 2020
adrianeboyd pushed a commit that referenced this pull request Jun 8, 2020
* adding spacy-universal-sentence-encoder

* update affiliation

* updated code example
@ruidaiphd
Copy link

ruidaiphd commented Aug 2, 2020

Hello Martino,

I really like the idea to combine spacy with google stuff, thanks! I got some seemingly random deadlock problem... I am an amateur developer and not very sure about the posting rules in this world. I hope it would not matter much if I also posted the same post on GitHub.

I am basically doing a n*(n-1)/2 comparison on 50K papers by their titles and abstracts on a server with 32 cores. I tired single-thread (without Pool) and find no problem among a few tens of Ks, but the deadlock happens almost right away when we I ran the following code. Do you have any suggestions? BTW, I also tried processes=1... it does not work...

def simCount(row):  
    return [row[0], row[3], row[2], row[5], nlp(row[1]).similarity(nlp(row[4]))]  

with Pool(processes=25) as p:
    with tqdm(total=count, desc='Testing') as pbar:
        for idx_left, row_left in _sim_tst.iterrows():
/*Some pandas frame arrangement*/
            for simscore in p.imap_unordered(simCount, _4sim.values.tolist()):  
                ssrn_simscore.append(simscore)
                pbar.update()

Many thanks!

@MartinoMensio
Copy link
Contributor Author

Hi @ray4wit,

Since this issue is relative to a problem of serialisation of Docs extension attributes, which is specific to the added project, I would suggest keeping the discussion here MartinoMensio/spacy-universal-sentence-encoder#6

Martino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation and website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants