Code implementation of BagBert: Bert-based bagging-stacking
python -m pip install -r requirements.txt
sample
: Create CSV samples from training dataset.
python bagbert sample data/BC7-LitCovid-Train.csv
positional arguments:
path Training dataset path.
optional arguments:
-h, --help Show this help message and exit
-o, --output [OUTPUT] Output dir path.
-m, --modes MODES [MODES ...] Sampling mode. Default "all" stands for "fields", "mask" and "augment".
-f, --fields FIELDS [FIELDS ...] List of fields order. Default "all" stands for "tak" and "tka".
-a, --augment [AUGMENT] Model name for context augmentation mode.
train
: Train one model for one sample.
python bagbert train experiments/pubmedbert-tak data/train-tak.csv data/val.csv
positional arguments:
model Model path (Folder with config.json file).
train Training dataset path.
val Validation dataset path.
optional arguments:
-h, --help Show this help message and exit
-f, --fields [FIELDS] Selected fields order. Default "tak" for title-abstract-keywords.
-c, --clean [CLEAN] Mask terms related to COVID-19. 0: False (default), 1: Remove, 2: Mask token.
-e, --epochs [EPOCHS] Maximum number of epochs if not stopped. Default 1000.
select
: Select k-sub-models based on Hamming loss.
python bagbert select experiments
positional arguments:
models Experiments directory path.
optional arguments:
-h, --help Show this help message and exit
-m, --min [MIN] Minimum k sub-model per model. Default 1.
-M, --max [MAX] Maximum k sub-model per model. Default 5.
predict
: Predict by average of inferences.
python bagbert predict experiments data/test.csv
positional arguments:
models Experiments directory path.
path Dataset path.
optional arguments:
-h, --help Show this help message and exit.
-o, --output [OUT] Output pickle filename. Default "predictions.pkl".
Due to the high size of one sub-model (~450MB),
we cannot provide all trained sub-models (~21GB). However, the initial weights of each model are available on the HF Hub. The model classes in model.py
inherit the model methods from the transformers module. Initial weights are:
- PubmedBERT from Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (Yu Gu et al.)
- Covid-SciBERT (pretrained by Tanmay Thakur)
- Clinical BERT from Publicly Available Clinical BERT Embeddings (Alsentzer Emily et al.)
- BioMed RoBERTa from Don't Stop Pretraining: Adapt Language Models to Domains and Tasks (Gururangan Suchin et al.)
To load the initial weights trained with PyTorch to TF, use the from_pt
argument. Save the model to TF mode.
from bagbert.model import BERTTopic
model = BERTTopic.from_pretrained('pytorch/model/path', from_pt = True)
model.save_pretrained('experiments/model_name')
If you find BagBERT useful in your research, please cite the following paper:
@misc{rakotoson2021bagbert,
title={BagBERT: BERT-based bagging-stacking for multi-topic classification},
author={Loïc Rakotoson and Charles Letaillieur and Sylvain Massip and Fréjus Laleye},
year={2021},
eprint={2111.05808},
archivePrefix={arXiv},
primaryClass={cs.CL}
}