This repository contains code and resources to train a speech-to-text model using the Uzbek voice dataset with Nvidia NeMo's Automatic Speech Recognition (ASR) toolkit.
- A machine with an NVIDIA GPU.
- Conda environment manager.
- Python 3.10
- Pytorch 1.13.1 or above
you have to download dataset from here
you will get clips.zip file and voice_dataset.json file
voice_dataset.json file contains meta data about dataset clips.zip file contains audio files
unzip clips.zip
you have to download pre_trained model from here and unzip it and put into the current directory
-
Clone the Repository:
git clone https://github.com/KamoliddinS/UzbekvoiceAsrTextToSpeechNemo.git cd UzbekvoiceAsrTextToSpeechNemo
You have to download pre_trained model from here and unzip it and put into the current directory.
-
Set Up a Conda Environment:
conda create --name nemo_asr_uzbek python==3.10.12 conda activate nemo_asr_uzbek
-
Install prerequisites:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
-
Install NeMo:
sudo apt-get update && apt-get install -y libsndfile1 ffmpeg pip install Cython pip install nemo_toolkit['all']
Note: You might need to install additional dependencies based on your specific requirements.
-
Install other dependencies:
pip install -r requirements.txt
The following steps are required to train a speech-to-text model using the Uzbek voice dataset.
Script: clean_stage_1.py
- Input:
voice_dataset.json
- Output:
1_stage_preprocessed_data.csv
Usage:
python clean_stage_1.py
Script: pre_procecessing_auido.py
- Input: Folder path containing the audio files from the uzbekvoice dataset.
- Function: Converts
.mp3
files to.wav
format.
Usage:
python pre_procecessing_auido.py --folder_path /path/to/uzbekvoice/dataset
Note: Download the uzbekvoice dataset audio files and provide the path to the dataset.
Script: levenshtain_clean.py
Usage:
python levenshtein_clean.py --input_csv 1_stage_preprocessed_data.csv --audio_files_dir /path/to/preprocessed/wav/files --output_csv output.csv --model_path /path/to/pretrained/model
Note:
- Download the pre-trained model from the provided link, unzip it, and place it in the repository's cloned directory. DOWNLOAD
- Provide the path to the preprocessed
.wav
files folder.
Script: nemo_asr_format.py
Usage:
python nemo_asr_format.py --csv_filepath output.csv --audio_files_path /path/to/audio/files --cer_threshold 0.18
Note: Provide the path to the audio files that were downloaded and preprocessed.
Script: train.py
Usage:
python train.py --train_json_path train.json --test_json_path test.json --model_name model_name --model_save_path /path/to/save/model --checkpoint True --num_epochs 10
Note:
- By default,
nemo_asr_format.py
outputstrain.json
andtest.json
. - Provide the desired model name and the path where you want to save the trained model.
- The
--checkpoint
flag determines whether to evaluate the model or not.
By following the above steps, you can preprocess, clean, and train an Uzbek Speech-to-Text model using Nvidia NeMo ASR. Ensure that all the required datasets and pre-trained models are downloaded and placed in the appropriate directories before running the scripts.
If you'd like to contribute to this project, please fork the repository and submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.