Skip to content

Latest commit

 

History

History
92 lines (80 loc) · 3.16 KB

README_eng.md

File metadata and controls

92 lines (80 loc) · 3.16 KB

SoftMaskedBert-PyTorch

🙈 An unofficial implementation of SoftMaskedBert based on huggingface/transformers.

English | 简体中文

prepare env

  1. install python 3.6+
  2. run the following command in terminal.
pip install -r requirements.txt

prepare data

  1. download sighan data from http://nlp.ee.ncu.edu.tw/resource/csc.html
  2. unzip the file and copy all the ''.sgml'' file to data/
  3. copy ''SIGHAN15_CSC_TestInput.txt'' and ''SIGHAN15_CSC_TestTruth.txt'' to data/
  4. download https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml to data/
  5. check following files are in data/
    train.sgml
    B1_training.sgml
    C1_training.sgml  
    SIGHAN15_CSC_A2_Training.sgml  
    SIGHAN15_CSC_B2_Training.sgml  
    SIGHAN15_CSC_TestInput.txt
    SIGHAN15_CSC_TestTruth.txt
    
  6. run the following command to process the data
python main.py --mode preproc

prepare bert checkpoint

  1. download bert checkpoint (pytorch_model.bin) from https://huggingface.co/bert-base-chinese/tree/main to checkpoint/

run

  1. run the following command to train the model.
python main.py --mode train
  1. run the following command to test the model.
python main.py --mode test
  1. you can use the following command to get any help for arguments.
python main.py --help
  --hard_device HARD_DEVICE
                        硬件,cpu or cuda
  --gpu_index GPU_INDEX
                        gpu索引, one of [0,1,2,3]
  --load_checkpoint [LOAD_CHECKPOINT]
                        是否加载训练保存的权重, one of [t,f]
  --bert_checkpoint BERT_CHECKPOINT
  --model_save_path MODEL_SAVE_PATH
  --epochs EPOCHS       训练轮数
  --batch_size BATCH_SIZE
                        批大小
  --warmup_epochs WARMUP_EPOCHS
                        warmup轮数, 需小于训练轮数
  --lr LR               学习率
  --accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
                        梯度累加的batch数
  --mode MODE           代码运行模式,以此来控制训练测试或数据预处理,one of [train, test, preproc]
  --loss_weight LOSS_WEIGHT
                        论文中的lambda,即correction loss的权重

experimental Results

char level

component p r f
Detection 0.8417 0.8274 0.8345
Correction 0.9487 0.8739 0.9106

sentence level

acc p r f
0.8145 0.8674 0.7361 0.7964

references

  1. Spelling Error Correction with Soft-Masked BERT
  2. http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
  3. https://github.com/wdimmy/Automatic-Corpus-Generation
  4. transformers
  5. https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check