🙈 An unofficial implementation of SoftMaskedBert based on huggingface/transformers.
English | 简体中文
- install python 3.6+
- run the following command in terminal.
pip install -r requirements.txt
- download sighan data from http://nlp.ee.ncu.edu.tw/resource/csc.html
- unzip the file and copy all the ''.sgml'' file to data/
- copy ''SIGHAN15_CSC_TestInput.txt'' and ''SIGHAN15_CSC_TestTruth.txt'' to data/
- download https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml to data/
- check following files are in data/
train.sgml B1_training.sgml C1_training.sgml SIGHAN15_CSC_A2_Training.sgml SIGHAN15_CSC_B2_Training.sgml SIGHAN15_CSC_TestInput.txt SIGHAN15_CSC_TestTruth.txt
- run the following command to process the data
python main.py --mode preproc
- download bert checkpoint (pytorch_model.bin) from https://huggingface.co/bert-base-chinese/tree/main to checkpoint/
- run the following command to train the model.
python main.py --mode train
- run the following command to test the model.
python main.py --mode test
- you can use the following command to get any help for arguments.
python main.py --help
--hard_device HARD_DEVICE
硬件,cpu or cuda
--gpu_index GPU_INDEX
gpu索引, one of [0,1,2,3]
--load_checkpoint [LOAD_CHECKPOINT]
是否加载训练保存的权重, one of [t,f]
--bert_checkpoint BERT_CHECKPOINT
--model_save_path MODEL_SAVE_PATH
--epochs EPOCHS 训练轮数
--batch_size BATCH_SIZE
批大小
--warmup_epochs WARMUP_EPOCHS
warmup轮数, 需小于训练轮数
--lr LR 学习率
--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
梯度累加的batch数
--mode MODE 代码运行模式,以此来控制训练测试或数据预处理,one of [train, test, preproc]
--loss_weight LOSS_WEIGHT
论文中的lambda,即correction loss的权重
component | p | r | f |
---|---|---|---|
Detection | 0.8417 | 0.8274 | 0.8345 |
Correction | 0.9487 | 0.8739 | 0.9106 |
acc | p | r | f |
---|---|---|---|
0.8145 | 0.8674 | 0.7361 | 0.7964 |