GitHub - izzykayu/RxSpace: Social media 4 health text classification

SM4H - Team RxSpace ⭐ !

Competition Details

This repository contains code for tackling Task 4 of the SMM2020

The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below.

Task 4: Automatic characterization of chatter related to prescription medication abuse in tweets

This new, multi-class classification task involves distinguishing, among tweets that mention at least one prescription opioid, benzodiazepine, atypical anti-psychotic, central nervous system stimulant or GABA analogue, tweets that report potential abuse/misuse (annotated as “A”) from those that report non-abuse/-misuse consumption (annotated as “C”), merely mention the medication (annotated as “M”), or are unrelated (annotated as “U”)3.

Timeline

Training data available: January 15, 2020 (may be sooner for some tasks)
Test data available: April 2, 2020
System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
System description paper submission deadline: May 5, 2020
Notification of acceptance of system description papers: June 10, 2020
Camera-ready papers due: June 30, 2020
Workshop: September 13, 2020
All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”).

Team

Team members

Isabel Metzger - [email protected]
Allison Black - [email protected]
Rajat Chandra - [email protected]
Rishi Bhargava - [email protected]
Emir Haskovic - [email protected]
Mark Rutledge - [email protected]
Natasha Zaliznyak - [email protected]
Whitley Yi - [email protected]

Our Approach

Our approach can be broken up into 3 main sections: preprocessing, model architectures, and Ensemble
Pre-processing: tokenization + using pre-trained embeddings/ creating our own pre-trained word representations
- Word Embeddings:
  - Glove (Pennington et al., 2014) , Word2Vec (Mikolov et al., 2013), fasText (Bojanowski et al., 2016):
    - params:
      - dim: 50, 100, 200, 300
  - Language Model: Elmo (Perters et al., 2018), Bert , sciBert:
    - params: default
- Model Architectures:
  - fasttext baseline
  - allennlp scibert text classifier
  - cnn text classifiers
  - train multiple models based on different training-set/val-set, different embeddings, different features, and even totally different architectures
- we also train with different data-splits
- *for all splits not using the originally provided train and dev set, we stratify by class e.g.,
  - Data split 1:
    - utilizing split provided from SMM4H
    - Train: orig train.csv (N = 10,537)
  - Dev: orig validation.csv (N =2,636)
- Data split 2:
  - using an 70% | 30% split
  - Train:
  - Dev:
- Data split 3:
  - using a holdout from the dev set for 15%
  - Train: 65%
  - Dev: 20%
  - Hold-out: 15%,
  - ```
  *Hold-out is used to tune the thresholds*
```
- Ensemble
- Voting:
  - Models trined on different splits with weights according to dev set
  - validation metric fine tuned for includes overall_f1 and abuse_f1
  - fixed threshold = 0.5
  - fine-tune threshold according to the hold-out set for unfixed thresshold
  - weight models according to best class f1 on validation

Requirements

Important packages/frameworks utilized include spacy, fastText, ekphrasis, allennlp, PyTorch, snorkel
To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert , which were downloaded with commands below

wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar
tar -xvf scibert_scivocab_uncased.tar

Exact requirements can be found in the requirements.txt file
For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook

Repo Layout

* notebooks - jupyter notebooks including notebooks that contain important steps including embedding preprocessing, preprocessing for our allennlp models, snorkel labeling fxns and evaluation/exploratory analysis, and our baseline fasttext model (preprocessing, training, and saving): process-emb.ipynb, preprocessing-jsonl.ipynb, snorkel.ipynb, fasttext-supervised-model.ipynb
* rx_twitterspace - allennlp library with our dataset loaders, predictors, and models
* nlp_configs - allennlp model experiment configurations
* preds - directory with predictions
* data-orig - directory with original raw data as provided from the SMM4H official task
* docs - more documentation (md and html files)
* saved-models - directory where saved models are
* preproc - bash scripts with import setup and pre-processing bash scripts such as converting fasttext embeddings for spacy and for compiling fasttext library

Text Corpora

Supervised Learning

Original train/validation split:
- We use the train.csv, validation.csv as provided from our competition train size = 10537 samples

	class counts	class %
m	5488	52.08
c	2940	27.90
a	1685	15.99
u	424	4.02

validation/dev: 2635 samples

	class counts	class %
m	1353	51.35
c	730	27.70
a	448	17.00
u	104	3.95

Multiple Splits:
- For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val

Unsupervised Learning

We created word embeddings using health social media posts from twitter and other public datasets. We used ekphrasis and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook.

Sources	Sentences/Tweets	Tokens
Twitter (SM4H)
Drug Reviews
Wikipedia

Embeddings

Snorkel

Labeling Functions

We used the snorkel framework for two major tasks: labeling Fxns and data augmentation
labeling function creation Notebook

##TODO: add link for data augmentation

data augmentation notebook

Model Training

baseline fastText supervised classifier
- model-bin
- Notebook
allennlp + PyTorch frameworks
model1
configuration
To run the model training with this configuration:
spacy model

allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/<your-model-dir> --include-package rx_twitterspace

Experiments ran so far include using exactly what is in nlp_configs/text_classification.json, where I have the data preprocessed in noteboooks in a directory called data-classification-jsonl and using the validation metric of best average F1 across all classes

allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/model1 --include-package rx_twitterspace

end std logging of training:

2020-03-25 06:51:35,274 - INFO - allennlp.models.archival - archiving weights and vocabulary to saved-models/model1/model.tar.gz
2020-03-25 06:51:55,764 - INFO - allennlp.common.util - Metrics:

  "best_epoch": 8,
  "peak_cpu_memory_MB": 1759.670272,
  "training_duration": "4:16:26.044395",
  "training_start_epoch": 0,
  "training_epochs": 17,
  "epoch": 17,
  "training_m_P": 0.9747639894485474,
  "training_m_R": 0.9783163070678711,
  "training_m_F1": 0.9765369296073914,
  "training_c_P": 0.9627350568771362,
  "training_c_R": 0.9578231573104858,
  "training_c_F1": 0.9602728486061096,
  "training_a_P": 0.923259973526001,
  "training_a_R": 0.9210682511329651,
  "training_a_F1": 0.9221628308296204,
  "training_u_P": 0.9810874462127686,
  "training_u_R": 0.9787735939025879,
  "training_u_F1": 0.9799291491508484,
  "training_average_F1": 0.9597254395484924,
  "training_accuracy": 0.9634620859827275,
  "training_loss": 0.10317539691212446,
  "training_cpu_memory_MB": 1759.670272,
  "validation_m_P": 0.8063355088233948,
  "validation_m_R": 0.8277900815010071,
  "validation_m_F1": 0.8169219493865967,
  "validation_c_P": 0.7048114538192749,
  "validation_c_R": 0.7424657344818115,
  "validation_c_F1": 0.7231488227844238,
  "validation_a_P": 0.5392670035362244,
  "validation_a_R": 0.4598214328289032,
  "validation_a_F1": 0.4963855445384979,
  "validation_u_P": 0.8315789699554443,
  "validation_u_R": 0.7596153616905212,
  "validation_u_F1": 0.7939698100090027,
  "validation_average_F1": 0.7076065316796303,
  "validation_accuracy": 0.7388994307400379,
  "validation_loss": 1.2185947988406722,
  "best_validation_m_P": 0.8156182169914246,
  "best_validation_m_R": 0.8337028622627258,
  "best_validation_m_F1": 0.8245614171028137,
  "best_validation_c_P": 0.6991150379180908,
  "best_validation_c_R": 0.7575342655181885,
  "best_validation_c_F1": 0.7271531820297241,
  "best_validation_a_P": 0.5498652458190918,
  "best_validation_a_R": 0.4553571343421936,
  "best_validation_a_F1": 0.49816855788230896,
  "best_validation_u_P": 0.8666666746139526,
  "best_validation_u_R": 0.75,
  "best_validation_u_F1": 0.8041236996650696,
  "best_validation_average_F1": 0.7135017141699791,
  "best_validation_accuracy": 0.7449715370018976,
  "best_validation_loss": 0.8092704885695354,
  "test_m_P": 0.8156182169914246,
  "test_m_R": 0.8337028622627258,
  "test_m_F1": 0.8245614171028137,
  "test_c_P": 0.6991150379180908,
  "test_c_R": 0.7575342655181885,
  "test_c_F1": 0.7271531820297241,
  "test_a_P": 0.5498652458190918,
  "test_a_R": 0.4553571343421936,
  "test_a_F1": 0.49816855788230896,
  "test_u_P": 0.8666666746139526,
  "test_u_R": 0.75,
  "test_u_F1": 0.8041236996650696,
  "test_average_F1": 0.7135017141699791,
  "test_accuracy": 0.7449715370018976,
  "test_loss": 0.8023609224572239
}

Evaluation

Embeddings

Analogy & similarity

Text classification

Run python eval-official.py to see the evaluation on predictions made from our fasttext baseline model which preprocessed text using ekphrasis

              precision    recall  f1-score   support

           a       0.55      0.35      0.43       448
           c       0.67      0.69      0.68       730
           m       0.76      0.85      0.80      1353
           u       0.87      0.68      0.76       104

    accuracy                           0.72      2635
   macro avg       0.71      0.64      0.67      2635
weighted avg       0.70      0.72      0.70      2635

Out of the box with fasttext.train_supervised(tweets.train)

              precision    recall  f1-score   support

           a       0.59      0.27      0.37       448
           c       0.65      0.68      0.67       730
           m       0.74      0.88      0.80      1353
           u       0.87      0.58      0.69       104

    accuracy                           0.71      2635
   macro avg       0.71      0.60      0.63      2635
weighted avg       0.70      0.71      0.69      2635

Future Work

Efficiently incorporating more sources:
- DrugBank
- UMLS
Creating more labeling fxns
Incorporating linguistic features, e.g., wordshape and POS

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
data-rxclass		data-rxclass
docs		docs
final		final
nlp_configs		nlp_configs
notebooks		notebooks
preds		preds
preproc		preproc
rx_twitterspace		rx_twitterspace
saved-models		saved-models
scibert_scivocab_uncased		scibert_scivocab_uncased
.gitignore		.gitignore
README.md		README.md
eval-official.py		eval-official.py
ex.py		ex.py
getting_spacy_preds.ipynb		getting_spacy_preds.ipynb
predict-tweet-bert.py		predict-tweet-bert.py
pretrain_textcat.py		pretrain_textcat.py
requirements.txt		requirements.txt
train-spacy-textcat.py		train-spacy-textcat.py
utilz.py		utilz.py
vectors_fast_text.py		vectors_fast_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SM4H - Team RxSpace ⭐ !

Table of Contents

Competition Details

Timeline

Team

Team members

Our Approach

Requirements

Repo Layout

Text Corpora

Supervised Learning

Unsupervised Learning

Embeddings

Snorkel

Labeling Functions

Model Training

Evaluation

Embeddings

Text classification

Future Work

Tags

About

Releases

Packages

Contributors 2

Languages

izzykayu/RxSpace

Folders and files

Latest commit

History

Repository files navigation

SM4H - Team RxSpace ⭐ !

Table of Contents

Competition Details

Timeline

Team

Team members

Our Approach

Requirements

Repo Layout

Text Corpora

Supervised Learning

Unsupervised Learning

Embeddings

Snorkel

Labeling Functions

Model Training

Evaluation

Embeddings

Text classification

Future Work

Tags

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages