SoftMaskedBert-PyTorch

🙈 An unofficial implementation of SoftMaskedBert based on huggingface/transformers.

prepare env

install python 3.6+
run the following command in terminal.

pip install -r requirements.txt

prepare data

download sighan data from http://nlp.ee.ncu.edu.tw/resource/csc.html
unzip the file and copy all the ''.sgml'' file to data/
copy ''SIGHAN15_CSC_TestInput.txt'' and ''SIGHAN15_CSC_TestTruth.txt'' to data/
download https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml to data/

check following files are in data/

train.sgml
B1_training.sgml
C1_training.sgml  
SIGHAN15_CSC_A2_Training.sgml  
SIGHAN15_CSC_B2_Training.sgml  
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

run the following command to process the data

python main.py --mode preproc

prepare bert checkpoint

download bert checkpoint (pytorch_model.bin) from https://huggingface.co/bert-base-chinese/tree/main to checkpoint/

run

run the following command to train the model.

python main.py --mode train

run the following command to test the model.

python main.py --mode test

you can use the following command to get any help for arguments.

python main.py --help

  --hard_device HARD_DEVICE
                        硬件，cpu or cuda
  --gpu_index GPU_INDEX
                        gpu索引, one of [0,1,2,3]
  --load_checkpoint [LOAD_CHECKPOINT]
                        是否加载训练保存的权重, one of [t,f]
  --bert_checkpoint BERT_CHECKPOINT
  --model_save_path MODEL_SAVE_PATH
  --epochs EPOCHS       训练轮数
  --batch_size BATCH_SIZE
                        批大小
  --warmup_epochs WARMUP_EPOCHS
                        warmup轮数, 需小于训练轮数
  --lr LR               学习率
  --accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
                        梯度累加的batch数
  --mode MODE           代码运行模式，以此来控制训练测试或数据预处理，one of [train, test, preproc]
  --loss_weight LOSS_WEIGHT
                        论文中的lambda，即correction loss的权重

experimental Results

char level

component	p	r	f
Detection	0.8417	0.8274	0.8345
Correction	0.9487	0.8739	0.9106

sentence level

acc	p	r	f
0.8145	0.8674	0.7361	0.7964

references

Spelling Error Correction with Soft-Masked BERT
http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
https://github.com/wdimmy/Automatic-Corpus-Generation
transformers
https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_eng.md

README_eng.md

SoftMaskedBert-PyTorch

prepare env

prepare data

prepare bert checkpoint

run

experimental Results

char level

sentence level

references

Files

README_eng.md

Latest commit

History

README_eng.md

File metadata and controls

SoftMaskedBert-PyTorch

prepare env

prepare data

prepare bert checkpoint

run

experimental Results

char level

sentence level

references