BERT Based Model for Punctuation and Capitalization Restoration

Features:

Uses Huggingface Tranformer library for base transformer architecture.
Pytorch Lightning is used for training and checkpoints.
Easy config based model description for easy experimenttation and reaearch.
Can be exported as a pytorch quantized model for faster inference on CPU.
Includes helper function for data preparation, text normalization, and offline sentence augmentation specific for punctuation and capitalization restoration.

Quick guide:

# Install requirements:
pip install -r requirements.txt

# Downloads raw text corpus from tatoeba for english language
bash download_tatoeba_en_sent.sh

# Preprocess raw text data. Check config file for more details
python preprocess_raw_text_data.py --config="example_configs/preprocess_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python merge_and_tokenize_datasets.py --config="example_configs/model_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python train_punct_and_capit_model.py --config="example_configs/model_config_en.yaml"

For inference:

from transformer_punct_and_capit.models import TransformerPunctAndCapitModel

model_path="experiments/model.pcm" # pcm_checkpoint path
model = TransformerPunctAndCapitModel.restore_model(model_path, device='cuda')

model.predict("how are you") # Single example
# Output: ["How are you?"]

model.predict_batch(["how are you"], batch_size=64, show_pbar=True) # Batch example
# Output: ["How are you?"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT Based Model for Punctuation and Capitalization Restoration

Features:

Quick guide:

For inference:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example_configs		example_configs
transformer_punct_and_capit		transformer_punct_and_capit
.gitignore		.gitignore
Readme.md		Readme.md
download_tatoeba_en_sent.sh		download_tatoeba_en_sent.sh
infer_example.ipynb		infer_example.ipynb
merge_and_tokenize_datasets.py		merge_and_tokenize_datasets.py
preprocess_raw_text_data.py		preprocess_raw_text_data.py
requirements.txt		requirements.txt
tatoeba_prepare.py		tatoeba_prepare.py
train_punct_and_capit_model.py		train_punct_and_capit_model.py

shashikg/transformer-punct-and-capit

Folders and files

Latest commit

History

Repository files navigation

BERT Based Model for Punctuation and Capitalization Restoration

Features:

Quick guide:

For inference:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages