Quynh M. Nguyen a, b and Kyle Cranmer a, c
a Physics Department, New York University, New York 10003
b Applied Math Lab, Courant Institute, New York University, New York 10012
c Center for Data Science, New York University, New York 10011
Running dynamic embedded topic modeling on abstracts of arxiv articles and discover how topics in STEM change in time. This is an implementation of Dynamic Embedded Topic Modeling by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei of Columbia University.
Visit https://www.kaggle.com/Cornell-University/arxiv to get arxiv-metadata-oai-snapshot.json
which contains about 2 million records, each has a dozen of fields, and we are interested in abstract
, categories
, and update_date
.
Modify the path to arxiv-metadata-oai-snapshot.json
in arxivtools/word2vec.py
and run:
python arxivtools/word2vec.py
This will read in abstracts, remove punctuations, remove stop words listed in arxivtools/stops.txt
, remove rare words that appear in less than 30 abstracts, and words appear in more than 70% of abstracts, and produces vector representations of all the words left (default embedding dimension = 300) using original settings from Mikolov 2013 NIPS paper. The ressults are save as embeddings.txt
where each line is a word following by 300 numbers. The process takes about an hour per 150,000 abstracts on a laptop.
Clone our fork of the original DETM repository
This is the main repo for DETM. We have made some changes to fix runtime errors, match the setting in the paper, adapt to arxiv metadata file, but no change to the model:
git clone https://github.com/quynhneo/DETM
The environtment could be set up by pip or conda, for example, using conda:
conda create --name detm --file requirements.txt
conda activate detm
This step will convert each abstract to a bag of words (bag of integer tokens to be exact), with timestamp for each abstract, split the data into train, validation, test. These will be stored in .mat
files. It also create a list of words, the vocabulary of all the abstracts, stored in vocab.txt
. This is just list of words, without vectors. The vectors will be taken from embeddings.txt
. So ideally the two lists contain the same words, or vocab
is a large subset of embeddings
.
Modify path to arxiv-metadata-oai-snapshot.json
in scripts/data_undebates.py
and run:
python scripts/data_undebates.py
This will take about 5 minutes per 150,000 abstracts on a laptop. Using default settings, the output will be save in script/split_paragraph_False/min_df_30
To run with all defaults settings, make changes in two lines:
https://github.com/quynhneo/DETM/blob/master/main.py#L34: the parent folder of preprocessed data folder min_df_30
.
https://github.com/quynhneo/DETM/blob/master/main.py#L35 : path to prefit embedding embeddings.txt
.
Run with all default settings:
python main.py
This stage will take much longer and should be run with GPU (CPU mode is too slow even with a 16 cores node)
More instruction for running on a cluster using CUDA is here
Output will be 3 .mat
files in results
.
Edit beta_file
in plot_word_evolution.py
to be the path to the file ending in _beta
in results
and run:
python plot_word_evolution.py
The plot below shows results for DETM trained on hep-ph
(high energy physics phenomenology) category, containning 150,000 abstracts. Six out of 50 topics are shown here. For each topics, probabilities of some selected words (in most cases, words with high probability) are plotted against time (2007-2020).
In topics #33 and #34, peak probability of the word 750
coincides with the flurry of papers on a possible discovery of new physics at 750 GeV around 2015-2016, which turned out to be just a statistical fluke. Topic 38 shows the increase in higgs
around the time of the discovery of Higgs boson in 2012.
The above plots are from running 400 epoches on data of 150,000 abstracts of hep-ph
. We use 1 Nvidia RTX8000 GPUs and the runtime was 13 hours.