Skip to content

This repository contains the code for training a Seq-GAN-BERT adversarial training model for disfluency correction using labeled and unlabeled sentences.

License

Notifications You must be signed in to change notification settings

vineet2104/AdversarialTrainingForDisfluencyCorrection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adversarial Training for Low-Resource Disfluency Correction

This repository contains the code and data used in the above short paper co-authored by Vineet Bhat, Preethi Jyothi and Pushpak Bhattacharyya, accepted at ACL 2023 Findings.

Abstract: Disfluencies commonly occur in conversational speech. Speech with disfluencies can result in noisy Automatic Speech Recognition (ASR) transcripts, which affects downstream tasks like machine translation. In this paper, we propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) that utilizes a small amount of labeled real disfluent data in conjunction with a large amount of unlabeled data. We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages- Bengali, Hindi, and Marathi (all from the Indo-Aryan family). Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments. We achieve an average 6.15 points improvement in F1-score over competitive baselines across all three languages mentioned. To the best of our knowledge, we are the first to utilize adversarial training for DC and use it to correct stuttering disfluencies in English, establishing a new benchmark for this task.

Steps to run the code -

  1. Create a new conda environment and install the necessary packages using the requirements file conda create --name --file requirements.txt
  2. You will have to convert your dataset into the right format for using the training notebook. Functions in ./utils/ folder can be used for this purpose a) Make sure your data is in a csv format with column names - "Disfluent Sentence" and "Fluent Sentence" containing the parallel DC data b) Add the path to the above csv file and run python3 ./utils/LabelsFromPairs.py to clean the data, remove punctuations and create corresponding .dis and .labels files c) Add path of ".dis" and ".labels" files create in step 2 b) and run python3 ./utils/PrepareDataset.py to create tsv files for both labeled and unlabeled data.
  3. Run trainer.ipynb notebook to train the model with adversarial training. Appropriate comments and instructions have been mentioned in the notebook.

For best training settings and data usage, refer to the paper here: https://aclanthology.org/2023.findings-acl.514 We have added ./data/sample_data folder to show the format of training files expected for trainer.ipynb. Stuttering Dataset has been shared here: ./data/stuttering-dataset.csv

Citation

@inproceedings{bhat-etal-2023-adversarial,
    title = "Adversarial Training for Low-Resource Disfluency Correction",
    author = "Bhat, Vineet  and
      Jyothi, Preethi  and
      Bhattacharyya, Pushpak",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.514",
    pages = "8112--8122",
}


About

This repository contains the code for training a Seq-GAN-BERT adversarial training model for disfluency correction using labeled and unlabeled sentences.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published