Code for our submission at CLEF eHealth Task 1: Multilingual Information Extraction. For details, check here.
If you're using new trasnformers library, then it is recommended to create virtual environment as this code was written with the older version (note there will be no issues even if both versions co-exist):
pip install pytorch-pretrained-bert
For migration to new library, look here. For baseline experiments, install scikit-learn
as well.
Raw data can be found under exps-data/data/*.txt
(this was provided by task organizers).
Pre-preprocessed data can be found under exps-data/data/{train, dev, test}_data.pkl
as pickled files. English translations are also provided for reproducibility (Google Translate API was used to get translations).
ICD-10 Metadata can be found under exps-data/codes_and_titles_{de, en}.txt
, where each line is tab delimited as [ICD Code Description] \t [ICD Code]
.
For static word embeddings, we used English and German vectors provided by fastText. For domain specific vectors, we used PubMed word2vec (only for English).
For contextualized word embeddings, BERT-base-cased and BioBERT for English and Multilingual-BERT-base-cased for German.
Store all the models under a directory MODELS
.
Set the path export BERT_MODEL=$MODELS/pubmed_pmc_470k
(e.g. BioBERT).
This script is provided by transformers library, but there might be some changes with new version so it is recommended to use the one installed with pytorch-pretrained-bert
:
python convert_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path $BERT_MODEL/biobert_model.ckpt \
--bert_config_file $BERT_MODEL/bert_config.json \
--pytorch_dump_path $BERT_MODEL/pytorch_model.bin
Configure the paths:
export DATA_DIR=exps-data/data
export BERT_EXPS_DIR=tmp/bert-exps-dir
export CUDA_VISIBLE_DEVICES=0,1,2,3
Run the model:
python bert_multilabel_run_classifier.py \
--data_dir $DATA_DIR \
--use_data en \
--bert_model $BERT_MODEL \
--task_name clef \
--output_dir $BERT_EXPS_DIR/output \
--cache_dir $BERT_EXPS_DIR/cache \
--max_seq_length 256 \
--num_train_epochs 20.0 \
--do_train \
--do_eval \
--train_batch_size 64
BERT English models (BioBERT, BERT-base-cased) results can be reproduced by 20 epochs and for multilingual BERT, with 25 epochs.
Run predictions (change files to test/dev manually in processor):
python bert_multilabel_run_classifier.py \
--data_dir $DATA_DIR \
--use_data en \
--bert_model $BERT_EXPS_DIR/output \
--task_name clef \
--output_dir $BERT_EXPS_DIR/output \
--cache_dir $BERT_EXPS_DIR/cache \
--max_seq_length 256 \
--do_eval
Use official evaluation.py
script to evaluate:
python evaluation.py --ids_file=$DATA_DIR/ids_development.txt \
--anns_file=$DATA_DIR/anns_train_dev.txt \
--dev_file=$BERT_EXPS_DIR/output/preds_development.txt \
--out_file=$BERT_EXPS_DIR/output/eval_output.txt
Change configurations here (no CLI yet). Main parameters are:
lang
: can be one of {en, de}
load_pretrain_ft
: whether to use fastText pre-trained embeddings, works for both languages.
load_pretrain_pubmed
: whether to use PubMed embeddings, works for English only.
pretrain_file
: path to pre-trained vectors, should be one of path/to/cc.{en, de}.300.vec
when load_pretrain_ft=True
and path/to/pubmed2018_w2v_400D.bin
when load_pretrain_pubmed=True
.
model_name
: name of the model; can be one of {cnn, han, slstm, clstm}
.
For other hyperparameters, check here.
After all the models have been tested and results placed under one directory (one has to manually check the folder names), use predict.py
to reproduce the numbers found in Results.txt
.