498KeywordExtraction

Usage

To run this code, run "python KeywordExtraction.py <dataset-path> <stoplist-path>"

About

This repo was created by Andrew Chen, Brian Lim, Chris Raveendra, and Matt Kim as a final project for the Natural Language Processing class at the University of Michigan.

The code strives to extract keywords, or words that best capture the main idea, of input text.

Keyword extraction has invaluable uses text mining and information retrieval; it can also be used to tag articles for everyday readers looking to read documents relevant to their interests.

Methodology

In our project we borrow several research paper methods while adding our own. They are as follows:

RAKE

As described in the Rose et al. paper here, the RAKE algorithm will seperate a text on stopwords. We use the resulting phrases as keyword candidates for further processing.

Part of Speech Tagging

Based on the observation that keywords frequently only contain nouns, verbs, and adjectives, we tag the keyword candidates with a part of speech using a Naive Bayes algorithm and discard candidates that contain non-nouns/verbs/adjectives.

Scoring Keywords

We give scores to the keywords by adding the results of the following two algorithms:

Co-Occurence Graph

We then score these filtered keywords using a TextRank algorithm, described in the Mihalcea and Tarau paper here. TextRank builds a graph using the words as vertices and relations to each other as edges. More connected words will be given a high score.

Forward Frequency

We calculate what we call the "forward frequency" for each keyword candidate. This is inspired by TF-IDF scores used in information retrieval - it is the product of the frequency of the keyword and the number of paragraphs that contain it. Keywords that appear in more paragraphs are more likely to be the main idea across a paper.

We then pick the 5 keywords that have the highest scores as our observed results.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
documents		documents
teams		teams
.POSTagger.py.swp		.POSTagger.py.swp
.gitignore		.gitignore
CoOccurrence.py		CoOccurrence.py
KeywordExtraction.py		KeywordExtraction.py
POS.train.large		POS.train.large
POSTagger.py		POSTagger.py
README.md		README.md
forward_frequency.py		forward_frequency.py
get_stuff.py		get_stuff.py
rake.py		rake.py
stoplist.txt		stoplist.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

498KeywordExtraction

Usage

About

Methodology

About

Releases

Packages

Contributors 3

Languages

bliminate/498KeywordExtraction

Folders and files

Latest commit

History

Repository files navigation

498KeywordExtraction

Usage

About

Methodology

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages