Skip to content

ICD-10 codes classification with BERT and other models for CLEF eHealth Task 1 (CLEF 2019).

Notifications You must be signed in to change notification settings

odnodn/multilabel-classification-bert-icd10

 
 

Repository files navigation

MLT-DFKI at CLEF eHealth Task 1: Multi-label Classification with BERT

Code for our submission at CLEF eHealth Task 1: Multilingual Information Extraction. For details, check here.

Requirements

If you're using new trasnformers library, then it is recommended to create virtual environment as this code was written with the older version (note there will be no issues even if both versions co-exist):

pip install pytorch-pretrained-bert

For migration to new library, look here. For baseline experiments, install scikit-learn as well.

Data

Raw data can be found under exps-data/data/*.txt (this was provided by task organizers).

Pre-preprocessed data can be found under exps-data/data/{train, dev, test}_data.pkl as pickled files. English translations are also provided for reproducibility (Google Translate API was used to get translations).

ICD-10 Metadata can be found under exps-data/codes_and_titles_{de, en}.txt, where each line is tab delimited as [ICD Code Description] \t [ICD Code].

Pre-trained Models

For static word embeddings, we used English and German vectors provided by fastText. For domain specific vectors, we used PubMed word2vec (only for English).

For contextualized word embeddings, BERT-base-cased and BioBERT for English and Multilingual-BERT-base-cased for German.

Store all the models under a directory MODELS.

Running BERT Models

Set the path export BERT_MODEL=$MODELS/pubmed_pmc_470k (e.g. BioBERT).

Convert TF checkpoint to PyTorch model

This script is provided by transformers library, but there might be some changes with new version so it is recommended to use the one installed with pytorch-pretrained-bert:

python convert_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path $BERT_MODEL/biobert_model.ckpt \
    --bert_config_file $BERT_MODEL/bert_config.json \
    --pytorch_dump_path $BERT_MODEL/pytorch_model.bin
Fine-tune the model

Configure the paths:

export DATA_DIR=exps-data/data
export BERT_EXPS_DIR=tmp/bert-exps-dir
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run the model:

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_MODEL \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --num_train_epochs 20.0 \
    --do_train \
    --do_eval \
    --train_batch_size 64

BERT English models (BioBERT, BERT-base-cased) results can be reproduced by 20 epochs and for multilingual BERT, with 25 epochs.

Inference

Run predictions (change files to test/dev manually in processor):

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_EXPS_DIR/output \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --do_eval 
Evaluate

Use official evaluation.py script to evaluate:

python evaluation.py --ids_file=$DATA_DIR/ids_development.txt \
                     --anns_file=$DATA_DIR/anns_train_dev.txt \
                     --dev_file=$BERT_EXPS_DIR/output/preds_development.txt \
                     --out_file=$BERT_EXPS_DIR/output/eval_output.txt

Running Other Models

Change configurations here (no CLI yet). Main parameters are:

lang: can be one of {en, de}

load_pretrain_ft: whether to use fastText pre-trained embeddings, works for both languages.

load_pretrain_pubmed: whether to use PubMed embeddings, works for English only.

pretrain_file: path to pre-trained vectors, should be one of path/to/cc.{en, de}.300.vec when load_pretrain_ft=True and path/to/pubmed2018_w2v_400D.bin when load_pretrain_pubmed=True.

model_name: name of the model; can be one of {cnn, han, slstm, clstm}.

For other hyperparameters, check here.

After all the models have been tested and results placed under one directory (one has to manually check the folder names), use predict.py to reproduce the numbers found in Results.txt.

About

ICD-10 codes classification with BERT and other models for CLEF eHealth Task 1 (CLEF 2019).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%