Skip to content

Code for BERT-based bagging-stacking for multi-topic classification

Notifications You must be signed in to change notification settings

opscidia/Bagbert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BagBert: BERT-based bagging-stacking for multi-topic classification

Python Tensorflow DOI

Code implementation of BagBert: Bert-based bagging-stacking

Requirements

python -m pip install -r requirements.txt

Commands

sample: Create CSV samples from training dataset.

python bagbert sample data/BC7-LitCovid-Train.csv


positional arguments:
  path                              Training dataset path.

optional arguments:
  -h, --help                        Show this help message and exit
  -o, --output [OUTPUT]             Output dir path.
  -m, --modes MODES [MODES ...]     Sampling mode. Default "all" stands for "fields", "mask" and "augment".
  -f, --fields FIELDS [FIELDS ...]  List of fields order. Default "all" stands for "tak" and "tka".
  -a, --augment [AUGMENT]           Model name for context augmentation mode.

train: Train one model for one sample.

python bagbert train experiments/pubmedbert-tak data/train-tak.csv data/val.csv


positional arguments:
  model                     Model path (Folder with config.json file).
  train                     Training dataset path.
  val                       Validation dataset path.

optional arguments:
  -h, --help                Show this help message and exit
  -f, --fields [FIELDS]     Selected fields order. Default "tak" for title-abstract-keywords.
  -c, --clean  [CLEAN]      Mask terms related to COVID-19. 0: False (default), 1: Remove, 2: Mask token.
  -e, --epochs [EPOCHS]     Maximum number of epochs if not stopped. Default 1000.

select: Select k-sub-models based on Hamming loss.

python bagbert select experiments


positional arguments:
  models                Experiments directory path.

optional arguments:
  -h, --help            Show this help message and exit
  -m, --min [MIN]       Minimum k sub-model per model. Default 1.
  -M, --max [MAX]       Maximum k sub-model per model. Default 5.

predict: Predict by average of inferences.

python bagbert predict experiments data/test.csv


positional arguments:
  models                Experiments directory path.
  path                  Dataset path.

optional arguments:
  -h, --help            Show this help message and exit.
  -o, --output [OUT]    Output pickle filename. Default "predictions.pkl".

Weights

Due to the high size of one sub-model (~450MB), we cannot provide all trained sub-models (~21GB). However, the initial weights of each model are available on the HF Hub. The model classes in model.py inherit the model methods from the transformers module. Initial weights are:

To load the initial weights trained with PyTorch to TF, use the from_pt argument. Save the model to TF mode.

from bagbert.model import BERTTopic

model = BERTTopic.from_pretrained('pytorch/model/path', from_pt = True)
model.save_pretrained('experiments/model_name')

Citation

If you find BagBERT useful in your research, please cite the following paper:

@misc{rakotoson2021bagbert,
  title={BagBERT: BERT-based bagging-stacking for multi-topic classification}, 
  author={Loïc Rakotoson and Charles Letaillieur and Sylvain Massip and Fréjus Laleye},
  year={2021},
  eprint={2111.05808},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

About

Code for BERT-based bagging-stacking for multi-topic classification

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages