Skip to content

Social media 4 health text classification

Notifications You must be signed in to change notification settings

izzykayu/RxSpace

Repository files navigation

SM4H - Team RxSpace ⭐ !

Table of Contents

Competition Details

This repository contains code for tackling Task 4 of the SMM2020

The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below.

Task 4: Automatic characterization of chatter related to prescription medication abuse in tweets

This new, multi-class classification task involves distinguishing, among tweets that mention at least one prescription opioid, benzodiazepine, atypical anti-psychotic, central nervous system stimulant or GABA analogue, tweets that report potential abuse/misuse (annotated as “A”) from those that report non-abuse/-misuse consumption (annotated as “C”), merely mention the medication (annotated as “M”), or are unrelated (annotated as “U”)3.

Timeline

  • Training data available: January 15, 2020 (may be sooner for some tasks)
  • Test data available: April 2, 2020
    System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
  • System description paper submission deadline: May 5, 2020
  • Notification of acceptance of system description papers: June 10, 2020
  • Camera-ready papers due: June 30, 2020
  • Workshop: September 13, 2020
  • All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”).

Team

Team members

Our Approach

  • Our approach can be broken up into 3 main sections: preprocessing, model architectures, and Ensemble
  • Pre-processing: tokenization + using pre-trained embeddings/ creating our own pre-trained word representations
    • Word Embeddings:
      • Glove (Pennington et al., 2014) , Word2Vec (Mikolov et al., 2013), fasText (Bojanowski et al., 2016):

        • params:
          • dim: 50, 100, 200, 300
      • Language Model: Elmo (Perters et al., 2018), Bert , sciBert:

        • params: default
          
    • Model Architectures:
      • fasttext baseline

      • allennlp scibert text classifier

      • cnn text classifiers

      • train multiple models based on different training-set/val-set, different embeddings, different features, and even totally different architectures

    • we also train with different data-splits
    • *for all splits not using the originally provided train and dev set, we stratify by class e.g.,
      • Data split 1:
        • utilizing split provided from SMM4H
        • Train: orig train.csv (N = 10,537)
      • Dev: orig validation.csv (N =2,636)
    • Data split 2:
      • using an 70% | 30% split
      • Train:
      • Dev:
    • Data split 3:
      • using a holdout from the dev set for 15%
      • Train: 65%
      • Dev: 20%
      • Hold-out: 15%,
      • *Hold-out is used to tune the thresholds*
        
    • Ensemble
    • Voting:
      • Models trined on different splits with weights according to dev set
      • validation metric fine tuned for includes overall_f1 and abuse_f1
      • fixed threshold = 0.5
      • fine-tune threshold according to the hold-out set for unfixed thresshold
      • weight models according to best class f1 on validation

Requirements

  • Important packages/frameworks utilized include spacy, fastText, ekphrasis, allennlp, PyTorch, snorkel
  • To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert , which were downloaded with commands below
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar
tar -xvf scibert_scivocab_uncased.tar
  • Exact requirements can be found in the requirements.txt file
  • For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook

Repo Layout

* notebooks - jupyter notebooks including notebooks that contain important steps including embedding preprocessing, preprocessing for our allennlp models, snorkel labeling fxns and evaluation/exploratory analysis, and our baseline fasttext model (preprocessing, training, and saving): process-emb.ipynb, preprocessing-jsonl.ipynb, snorkel.ipynb, fasttext-supervised-model.ipynb
* rx_twitterspace - allennlp library with our dataset loaders, predictors, and models
* nlp_configs - allennlp model experiment configurations
* preds - directory with predictions
* data-orig - directory with original raw data as provided from the SMM4H official task
* docs - more documentation (md and html files)
* saved-models - directory where saved models are
* preproc - bash scripts with import setup and pre-processing bash scripts such as converting fasttext embeddings for spacy and for compiling fasttext library

Text Corpora

Supervised Learning

  • Original train/validation split:
    • We use the train.csv, validation.csv as provided from our competition train size = 10537 samples
class counts class %
m 5488 52.08
c 2940 27.90
a 1685 15.99
u 424 4.02
validation/dev: 2635 samples
class counts class %
m 1353 51.35
c 730 27.70
a 448 17.00
u 104 3.95
  • Multiple Splits:
    • For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val

Unsupervised Learning

We created word embeddings using health social media posts from twitter and other public datasets. We used ekphrasis and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook.

Sources Sentences/Tweets Tokens
Twitter (SM4H)
Drug Reviews
Wikipedia

Embeddings

Snorkel

Labeling Functions

  • We used the snorkel framework for two major tasks: labeling Fxns and data augmentation
  • labeling function creation Notebook

##TODO: add link for data augmentation

Model Training

  • baseline fastText supervised classifier
  • allennlp + PyTorch frameworks
  • model1
  • configuration
  • To run the model training with this configuration:
  • spacy model
allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/<your-model-dir> --include-package rx_twitterspace
  • Experiments ran so far include using exactly what is in nlp_configs/text_classification.json, where I have the data preprocessed in noteboooks in a directory called data-classification-jsonl and using the validation metric of best average F1 across all classes
allennlp train nlp_configs/text_classification.json --serialization-dir saved-models/model1 --include-package rx_twitterspace
  • end std logging of training:
2020-03-25 06:51:35,274 - INFO - allennlp.models.archival - archiving weights and vocabulary to saved-models/model1/model.tar.gz
2020-03-25 06:51:55,764 - INFO - allennlp.common.util - Metrics:
  "best_epoch": 8,
  "peak_cpu_memory_MB": 1759.670272,
  "training_duration": "4:16:26.044395",
  "training_start_epoch": 0,
  "training_epochs": 17,
  "epoch": 17,
  "training_m_P": 0.9747639894485474,
  "training_m_R": 0.9783163070678711,
  "training_m_F1": 0.9765369296073914,
  "training_c_P": 0.9627350568771362,
  "training_c_R": 0.9578231573104858,
  "training_c_F1": 0.9602728486061096,
  "training_a_P": 0.923259973526001,
  "training_a_R": 0.9210682511329651,
  "training_a_F1": 0.9221628308296204,
  "training_u_P": 0.9810874462127686,
  "training_u_R": 0.9787735939025879,
  "training_u_F1": 0.9799291491508484,
  "training_average_F1": 0.9597254395484924,
  "training_accuracy": 0.9634620859827275,
  "training_loss": 0.10317539691212446,
  "training_cpu_memory_MB": 1759.670272,
  "validation_m_P": 0.8063355088233948,
  "validation_m_R": 0.8277900815010071,
  "validation_m_F1": 0.8169219493865967,
  "validation_c_P": 0.7048114538192749,
  "validation_c_R": 0.7424657344818115,
  "validation_c_F1": 0.7231488227844238,
  "validation_a_P": 0.5392670035362244,
  "validation_a_R": 0.4598214328289032,
  "validation_a_F1": 0.4963855445384979,
  "validation_u_P": 0.8315789699554443,
  "validation_u_R": 0.7596153616905212,
  "validation_u_F1": 0.7939698100090027,
  "validation_average_F1": 0.7076065316796303,
  "validation_accuracy": 0.7388994307400379,
  "validation_loss": 1.2185947988406722,
  "best_validation_m_P": 0.8156182169914246,
  "best_validation_m_R": 0.8337028622627258,
  "best_validation_m_F1": 0.8245614171028137,
  "best_validation_c_P": 0.6991150379180908,
  "best_validation_c_R": 0.7575342655181885,
  "best_validation_c_F1": 0.7271531820297241,
  "best_validation_a_P": 0.5498652458190918,
  "best_validation_a_R": 0.4553571343421936,
  "best_validation_a_F1": 0.49816855788230896,
  "best_validation_u_P": 0.8666666746139526,
  "best_validation_u_R": 0.75,
  "best_validation_u_F1": 0.8041236996650696,
  "best_validation_average_F1": 0.7135017141699791,
  "best_validation_accuracy": 0.7449715370018976,
  "best_validation_loss": 0.8092704885695354,
  "test_m_P": 0.8156182169914246,
  "test_m_R": 0.8337028622627258,
  "test_m_F1": 0.8245614171028137,
  "test_c_P": 0.6991150379180908,
  "test_c_R": 0.7575342655181885,
  "test_c_F1": 0.7271531820297241,
  "test_a_P": 0.5498652458190918,
  "test_a_R": 0.4553571343421936,
  "test_a_F1": 0.49816855788230896,
  "test_u_P": 0.8666666746139526,
  "test_u_R": 0.75,
  "test_u_F1": 0.8041236996650696,
  "test_average_F1": 0.7135017141699791,
  "test_accuracy": 0.7449715370018976,
  "test_loss": 0.8023609224572239
}

Evaluation

Embeddings

  • Analogy & similarity

Text classification

  • Run python eval-official.py to see the evaluation on predictions made from our fasttext baseline model which preprocessed text using ekphrasis
              precision    recall  f1-score   support

           a       0.55      0.35      0.43       448
           c       0.67      0.69      0.68       730
           m       0.76      0.85      0.80      1353
           u       0.87      0.68      0.76       104

    accuracy                           0.72      2635
   macro avg       0.71      0.64      0.67      2635
weighted avg       0.70      0.72      0.70      2635

Out of the box with fasttext.train_supervised(tweets.train)

              precision    recall  f1-score   support

           a       0.59      0.27      0.37       448
           c       0.65      0.68      0.67       730
           m       0.74      0.88      0.80      1353
           u       0.87      0.58      0.69       104

    accuracy                           0.71      2635
   macro avg       0.71      0.60      0.63      2635
weighted avg       0.70      0.71      0.69      2635

Future Work

  • Efficiently incorporating more sources:
    • DrugBank
    • UMLS
  • Creating more labeling fxns
  • Incorporating linguistic features, e.g., wordshape and POS

Tags

  • data augmentation, weak supervision, noisy labeling, word embeddings, text classification, multi-label, multi-class, scalability, ensemble

About

Social media 4 health text classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published