Keyword extraction using TextRank

The aim of this project is to measure similarity between companies using their descriptions. Several different hand crafted methods have been used to create a similarity score (usually based on number of keywords in common, sometimes with some weight on the keywords). An example of the hand crafted methodis included in in the "new_similarities.py" file.

Another approach was to use a vectorization method on the company descriptions, which could be used to calculate similarities simply by using the dot product (vectors are normalized). A few vectorization methods were attempted, the two most successful were simple LSA (TF-IDF + SVD), as well as a custom method based on the TextRank algorithm.

Requirements

Python (3), with the packages specified in requirements.txt.

TextRank

The vectorization based on TextRank is slighly involved, and therefore needs some explanation. First, a Word2Vec model is trained on a large corpus in order to create a word vectorization model. Then, the TextRank algorithm (a small modification from PageRank) is applied to each company description in order to extract top 10 most important words in each description. These vectors are averaged (can be weighted or unweighted) to produce the single vector for each description.

Similarity is then computed as a dot product, like after LSA.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
KNN		KNN
TextRank		TextRank
.gitignore		.gitignore
README.md		README.md
create_csv.py		create_csv.py
crunchbase_corpus.txt		crunchbase_corpus.txt
high_weight_keywords.xlsx		high_weight_keywords.xlsx
inspect_results.py		inspect_results.py
new_similarities.py		new_similarities.py
preprocessing.py		preprocessing.py
repopulate_keywords.py		repopulate_keywords.py
requirements.txt		requirements.txt
similarities.py		similarities.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Keyword extraction using TextRank

Requirements

TextRank

About

Releases

Packages

Languages

lupusmalus/keyword-extraction

Folders and files

Latest commit

History

Repository files navigation

Keyword extraction using TextRank

Requirements

TextRank

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages