Automatic text summarization is the task of producing a text summary "from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that" \cite{summarization}.(Dragomir R Radev and McK-eown, 2002). We adapt a recent centroid-based text summarization model, one that takes advantage of the compositionality of word embeddings, in order to obtain a single vector representation of the most meaningful words in a given text. We propose utilizing Latent Dirichlet Allocation (LDA), a probabilistic generative model for collections of discrete data, in order to better obtain the topic words of a document for use in constructing the centroid vector. We see that the LDA implementation results in overall more coherent summaries, suggesting the potential for utilizing topic models to improve upon the general centroid-based method.
- Centroid-based Text Summarization through Compositionality of Word Embeddings https://www.aclweb.org/anthology/W17-1003.pdf
- Repo: https://github.com/gaetangate/text-summarizer
- Download the Google Vectors from https://github.com/mmihaltz/word2vec-GoogleNews-vectors and place them into the
data_clean
folder. - Copy all directories from
duc2004\testdata\tasks1and2\t1.2\docs
(DUC data not distributed in this repo due to licensing rescritions) todata_raw/articles
- Move files from
duc2004\results\ROUGE\eval\peers\2
todata_raw/summaries
- Run
data_raw/import_corpus.py
- Copy
data_raw/corpus.pkl
tocloned_summarizer/text_summarizer
- Models are avaliable in
src
. Example expirements avaliable inEvaluate_DUC.ipynb