Improving Centroid-Based Text Summarization through LDA-based Document Centroids

Automatic text summarization is the task of producing a text summary "from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that" \cite{summarization}.(Dragomir R Radev and McK-eown, 2002). We adapt a recent centroid-based text summarization model, one that takes advantage of the compositionality of word embeddings, in order to obtain a single vector representation of the most meaningful words in a given text. We propose utilizing Latent Dirichlet Allocation (LDA), a probabilistic generative model for collections of discrete data, in order to better obtain the topic words of a document for use in constructing the centroid vector. We see that the LDA implementation results in overall more coherent summaries, suggesting the potential for utilizing topic models to improve upon the general centroid-based method.

Our paper:

https://drive.google.com/file/d/1plsPxIYHsWAtW50tYm7YJvOgpuveqFqn/view?usp=sharing

This work is based on:

Centroid-based Text Summarization through Compositionality of Word Embeddings https://www.aclweb.org/anthology/W17-1003.pdf
Repo: https://github.com/gaetangate/text-summarizer

Running the Code

Download the Google Vectors from https://github.com/mmihaltz/word2vec-GoogleNews-vectors and place them into the data_clean folder.
Copy all directories from duc2004\testdata\tasks1and2\t1.2\docs (DUC data not distributed in this repo due to licensing rescritions) to data_raw/articles
Move files from duc2004\results\ROUGE\eval\peers\2 to data_raw/summaries
Run data_raw/import_corpus.py
Copy data_raw/corpus.pkl to cloned_summarizer/text_summarizer
Models are avaliable in src. Example expirements avaliable in Evaluate_DUC.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
data_clean		data_clean
data_raw		data_raw
figs		figs
literature		literature
scripts		scripts
src		src
tests		tests
tutorials		tutorials
.gitignore		.gitignore
LDA_Centroid_text_summarization.pdf		LDA_Centroid_text_summarization.pdf
README.md		README.md
project_desc.pdf		project_desc.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Centroid-Based Text Summarization through LDA-based Document Centroids

Our paper:

This work is based on:

Running the Code

Centroid Embeddings:

Our Proposed Change:

Sentence embeddings:

Centroid-sentence similarity:

Selection algorithm:

Rouge:

About

Releases

Packages

Languages

JairParra/LDA_centroid_based_summarization

Folders and files

Latest commit

History

Repository files navigation

Improving Centroid-Based Text Summarization through LDA-based Document Centroids

Our paper:

This work is based on:

Running the Code

Centroid Embeddings:

Our Proposed Change:

Sentence embeddings:

Centroid-sentence similarity:

Selection algorithm:

Rouge:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages