Document Similarity using Word2Vec

Calculate the similarity distance between documents using pre-trained word2vec model.

Usage

Load a pre-trained word2vec model. Note: You can use Google's pre-trained word2vec model, if you don't have one.

from gensim.models.keyedvectors import KeyedVectors
model_path = './data/GoogleNews-vectors-negative300.bin'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

Once the model is loaded, it can be passed to DocSim class to calculate document similarities.
```
from DocSim import DocSim
ds = DocSim(w2v_model)
```

Calculate the similarity score between a source document & a list of target documents.

source_doc = 'how to delete an invoice'
target_docs = ['delete a invoice', 'how do i remove an invoice', 'purge an invoice']

# This will return 3 target docs with similarity score
sim_scores = ds.calculate_similarity(source_doc, target_docs)

print(sim_scores)

Output is as follows:

  [ {'score': 0.99999994, 'doc': 'delete a invoice'}, 
  {'score': 0.79869318, 'doc': 'how do i remove an invoice'}, 
  {'score': 0.71488398, 'doc': 'purge an invoice'} ]

Note: You can optionally pass a threshold argument to the calculate_similarity() method to return only the target documents with similarity score above the threshold.
```
sim_scores = ds.calculate_similarity(source_doc, target_docs, threshold=0.7)
```

Requirements

Python 3 only
gensim : to load the word2vec model
numpy : to calculate similarity scores

License

The MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
.travis.yml		.travis.yml
DocSim.py		DocSim.py
LICENSE		LICENSE
README.md		README.md
example.py		example.py
requirements.txt		requirements.txt
test_DocSim.py		test_DocSim.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Similarity using Word2Vec

Usage

Requirements

License

About

Releases

Packages

Contributors 4

Languages

License

v1shwa/document-similarity

Folders and files

Latest commit

History

Repository files navigation

Document Similarity using Word2Vec

Usage

Requirements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages