- Competition Details
- Team Members ✨ ✨ 📧
- Our Approach 🔖
- Requirements
- Repo Layout
- Text Corpora 📚
- Embeddings
- Snorkel
- Model Training
- Evaluation 📈
- References
- Tags
- Future Work 🔮
This repository contains code for tackling Task 4 of the SMM2020
The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below.
Task 4: Automatic characterization of chatter related to prescription medication abuse in tweets
This new, multi-class classification task involves distinguishing, among tweets that mention at least one prescription opioid, benzodiazepine, atypical anti-psychotic, central nervous system stimulant or GABA analogue, tweets that report potential abuse/misuse (annotated as “A”) from those that report non-abuse/-misuse consumption (annotated as “C”), merely mention the medication (annotated as “M”), or are unrelated (annotated as “U”)3.
- Training data available: January 15, 2020 (may be sooner for some tasks)
- Test data available: April 2, 2020
System predictions for test data due: April 5, 2020 (23:59 CodaLab server time) - System description paper submission deadline: May 5, 2020
- Notification of acceptance of system description papers: June 10, 2020
- Camera-ready papers due: June 30, 2020
- Workshop: September 13, 2020
- All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”).
- Isabel Metzger - [email protected]
- Allison Black - [email protected]
- Rajat Chandra - [email protected]
- Rishi Bhargava - [email protected]
- Emir Haskovic - [email protected]
- Mark Rutledge - [email protected]
- Natasha Zaliznyak - [email protected]
- Whitley Yi - [email protected]
- Our approach can be broken up into 3 main sections: preprocessing, model architectures, and Ensemble
- Pre-processing:
tokenization + using pre-trained embeddings/ creating our own pre-trained word representations
- Word Embeddings:
-
Glove (Pennington et al., 2014) , Word2Vec (Mikolov et al., 2013), fasText (Bojanowski et al., 2016):
- params:
- dim: 50, 100, 200, 300
- params:
-
Language Model: Elmo (Perters et al., 2018), Bert , sciBert:
-
params: default
-
-
- Model Architectures:
-
fasttext baseline
-
allennlp scibert text classifier
-
cnn text classifiers
-
train multiple models based on different training-set/val-set, different embeddings, different features, and even totally different architectures
-
- we also train with different data-splits
- *for all splits not using the originally provided train and dev set, we stratify by class e.g.,
- Data split 1:
- utilizing split provided from SMM4H
- Train: orig train.csv (N = 10,537)
- Dev: orig validation.csv (N =2,636)
- Data split 1:
- Data split 2:
- using an 70% | 30% split
- Train:
- Dev:
- Data split 3:
- using a holdout from the dev set for 15%
- Train: 65%
- Dev: 20%
- Hold-out: 15%,
-
*Hold-out is used to tune the thresholds*
- Ensemble
- Voting:
- Models trined on different splits with weights according to dev set
- validation metric fine tuned for includes overall_f1 and abuse_f1
- fixed threshold = 0.5
- fine-tune threshold according to the hold-out set for unfixed thresshold
- weight models according to best class f1 on validation
- Word Embeddings:
- Important packages/frameworks utilized include spacy, fastText, ekphrasis, allennlp, PyTorch, snorkel
- To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert , which were downloaded with commands below
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar
tar -xvf scibert_scivocab_uncased.tar
- Exact requirements can be found in the requirements.txt file
- For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook
* notebooks - jupyter notebooks including notebooks that contain important steps including embedding preprocessing, preprocessing for our allennlp models, snorkel labeling fxns and evaluation/exploratory analysis, and our baseline fasttext model (preprocessing, training, and saving): process-emb.ipynb, preprocessing-jsonl.ipynb, snorkel.ipynb, fasttext-supervised-model.ipynb
* rx_twitterspace - allennlp library with our dataset loaders, predictors, and models
* nlp_configs - allennlp model experiment configurations
* preds - directory with predictions
* data-orig - directory with original raw data as provided from the SMM4H official task
* docs - more documentation (md and html files)
* saved-models - directory where saved models are
* preproc - bash scripts with import setup and pre-processing bash scripts such as converting fasttext embeddings for spacy and for compiling fasttext library
- Original train/validation split:
- We use the train.csv, validation.csv as provided from our competition train size = 10537 samples
class counts | class % | |
---|---|---|
m | 5488 | 52.08 |
c | 2940 | 27.90 |
a | 1685 | 15.99 |
u | 424 | 4.02 |
class counts | class % | |
---|---|---|
m | 1353 | 51.35 |
c | 730 | 27.70 |
a | 448 | 17.00 |
u | 104 | 3.95 |
- Multiple Splits:
- For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val
- For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val
We created word embeddings using health social media posts from twitter and other public datasets. We used ekphrasis and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook.
Sources | Sentences/Tweets | Tokens |
---|---|---|
Twitter (SM4H) | ||
Drug Reviews | ||
Wikipedia | ||
- We used the snorkel framework for two major tasks: labeling Fxns and data augmentation
- labeling function creation Notebook
##TODO: add link for data augmentation
- data augmentation notebook
- baseline fastText supervised classifier
- model-bin
- Notebook
- allennlp + PyTorch frameworks
- model1
- configuration
- To run the model training with this configuration:
- spacy model
allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/<your-model-dir> --include-package rx_twitterspace
- Experiments ran so far include using exactly what is in nlp_configs/text_classification.json, where I have the data preprocessed in noteboooks in a directory called
data-classification-jsonl
and using the validation metric of best average F1 across all classes
allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/model1 --include-package rx_twitterspace
- end std logging of training:
2020-03-25 06:51:35,274 - INFO - allennlp.models.archival - archiving weights and vocabulary to saved-models/model1/model.tar.gz
2020-03-25 06:51:55,764 - INFO - allennlp.common.util - Metrics:
"best_epoch": 8,
"peak_cpu_memory_MB": 1759.670272,
"training_duration": "4:16:26.044395",
"training_start_epoch": 0,
"training_epochs": 17,
"epoch": 17,
"training_m_P": 0.9747639894485474,
"training_m_R": 0.9783163070678711,
"training_m_F1": 0.9765369296073914,
"training_c_P": 0.9627350568771362,
"training_c_R": 0.9578231573104858,
"training_c_F1": 0.9602728486061096,
"training_a_P": 0.923259973526001,
"training_a_R": 0.9210682511329651,
"training_a_F1": 0.9221628308296204,
"training_u_P": 0.9810874462127686,
"training_u_R": 0.9787735939025879,
"training_u_F1": 0.9799291491508484,
"training_average_F1": 0.9597254395484924,
"training_accuracy": 0.9634620859827275,
"training_loss": 0.10317539691212446,
"training_cpu_memory_MB": 1759.670272,
"validation_m_P": 0.8063355088233948,
"validation_m_R": 0.8277900815010071,
"validation_m_F1": 0.8169219493865967,
"validation_c_P": 0.7048114538192749,
"validation_c_R": 0.7424657344818115,
"validation_c_F1": 0.7231488227844238,
"validation_a_P": 0.5392670035362244,
"validation_a_R": 0.4598214328289032,
"validation_a_F1": 0.4963855445384979,
"validation_u_P": 0.8315789699554443,
"validation_u_R": 0.7596153616905212,
"validation_u_F1": 0.7939698100090027,
"validation_average_F1": 0.7076065316796303,
"validation_accuracy": 0.7388994307400379,
"validation_loss": 1.2185947988406722,
"best_validation_m_P": 0.8156182169914246,
"best_validation_m_R": 0.8337028622627258,
"best_validation_m_F1": 0.8245614171028137,
"best_validation_c_P": 0.6991150379180908,
"best_validation_c_R": 0.7575342655181885,
"best_validation_c_F1": 0.7271531820297241,
"best_validation_a_P": 0.5498652458190918,
"best_validation_a_R": 0.4553571343421936,
"best_validation_a_F1": 0.49816855788230896,
"best_validation_u_P": 0.8666666746139526,
"best_validation_u_R": 0.75,
"best_validation_u_F1": 0.8041236996650696,
"best_validation_average_F1": 0.7135017141699791,
"best_validation_accuracy": 0.7449715370018976,
"best_validation_loss": 0.8092704885695354,
"test_m_P": 0.8156182169914246,
"test_m_R": 0.8337028622627258,
"test_m_F1": 0.8245614171028137,
"test_c_P": 0.6991150379180908,
"test_c_R": 0.7575342655181885,
"test_c_F1": 0.7271531820297241,
"test_a_P": 0.5498652458190918,
"test_a_R": 0.4553571343421936,
"test_a_F1": 0.49816855788230896,
"test_u_P": 0.8666666746139526,
"test_u_R": 0.75,
"test_u_F1": 0.8041236996650696,
"test_average_F1": 0.7135017141699791,
"test_accuracy": 0.7449715370018976,
"test_loss": 0.8023609224572239
}
- Analogy & similarity
- Run
python eval-official.py
to see the evaluation on predictions made from our fasttext baseline model which preprocessed text using ekphrasis
precision recall f1-score support
a 0.55 0.35 0.43 448
c 0.67 0.69 0.68 730
m 0.76 0.85 0.80 1353
u 0.87 0.68 0.76 104
accuracy 0.72 2635
macro avg 0.71 0.64 0.67 2635
weighted avg 0.70 0.72 0.70 2635
Out of the box with fasttext.train_supervised(tweets.train)
precision recall f1-score support
a 0.59 0.27 0.37 448
c 0.65 0.68 0.67 730
m 0.74 0.88 0.80 1353
u 0.87 0.58 0.69 104
accuracy 0.71 2635
macro avg 0.71 0.60 0.63 2635
weighted avg 0.70 0.71 0.69 2635
- Efficiently incorporating more sources:
- DrugBank
- UMLS
- Creating more labeling fxns
- Incorporating linguistic features, e.g., wordshape and POS
- data augmentation, weak supervision, noisy labeling, word embeddings, text classification, multi-label, multi-class, scalability, ensemble