BagBert: BERT-based bagging-stacking for multi-topic classification

Code implementation of BagBert: Bert-based bagging-stacking

Requirements

python -m pip install -r requirements.txt

Commands

sample: Create CSV samples from training dataset.

python bagbert sample data/BC7-LitCovid-Train.csv


positional arguments:
  path                              Training dataset path.

optional arguments:
  -h, --help                        Show this help message and exit
  -o, --output [OUTPUT]             Output dir path.
  -m, --modes MODES [MODES ...]     Sampling mode. Default "all" stands for "fields", "mask" and "augment".
  -f, --fields FIELDS [FIELDS ...]  List of fields order. Default "all" stands for "tak" and "tka".
  -a, --augment [AUGMENT]           Model name for context augmentation mode.

train: Train one model for one sample.

python bagbert train experiments/pubmedbert-tak data/train-tak.csv data/val.csv


positional arguments:
  model                     Model path (Folder with config.json file).
  train                     Training dataset path.
  val                       Validation dataset path.

optional arguments:
  -h, --help                Show this help message and exit
  -f, --fields [FIELDS]     Selected fields order. Default "tak" for title-abstract-keywords.
  -c, --clean  [CLEAN]      Mask terms related to COVID-19. 0: False (default), 1: Remove, 2: Mask token.
  -e, --epochs [EPOCHS]     Maximum number of epochs if not stopped. Default 1000.

select: Select k-sub-models based on Hamming loss.

python bagbert select experiments


positional arguments:
  models                Experiments directory path.

optional arguments:
  -h, --help            Show this help message and exit
  -m, --min [MIN]       Minimum k sub-model per model. Default 1.
  -M, --max [MAX]       Maximum k sub-model per model. Default 5.

predict: Predict by average of inferences.

python bagbert predict experiments data/test.csv


positional arguments:
  models                Experiments directory path.
  path                  Dataset path.

optional arguments:
  -h, --help            Show this help message and exit.
  -o, --output [OUT]    Output pickle filename. Default "predictions.pkl".

Weights

Due to the high size of one sub-model (~450MB), we cannot provide all trained sub-models (~21GB). However, the initial weights of each model are available on the HF Hub. The model classes in model.py inherit the model methods from the transformers module. Initial weights are:

PubmedBERT from Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (Yu Gu et al.)
Covid-SciBERT (pretrained by Tanmay Thakur)
Clinical BERT from Publicly Available Clinical BERT Embeddings (Alsentzer Emily et al.)
BioMed RoBERTa from Don't Stop Pretraining: Adapt Language Models to Domains and Tasks (Gururangan Suchin et al.)

To load the initial weights trained with PyTorch to TF, use the from_pt argument. Save the model to TF mode.

from bagbert.model import BERTTopic

model = BERTTopic.from_pretrained('pytorch/model/path', from_pt = True)
model.save_pretrained('experiments/model_name')

Citation

If you find BagBERT useful in your research, please cite the following paper:

@misc{rakotoson2021bagbert,
  title={BagBERT: BERT-based bagging-stacking for multi-topic classification}, 
  author={Loïc Rakotoson and Charles Letaillieur and Sylvain Massip and Fréjus Laleye},
  year={2021},
  eprint={2111.05808},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bagbert		bagbert
data		data
experiments		experiments
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.MD		README.MD
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BagBert: BERT-based bagging-stacking for multi-topic classification

Requirements

Commands

Weights

Citation

About

Releases 1

Packages

Languages

opscidia/Bagbert

Folders and files

Latest commit

History

Repository files navigation

BagBert: BERT-based bagging-stacking for multi-topic classification

Requirements

Commands

Weights

Citation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages