Speaker_Recognition

Recognizing a speaker by his/her/their voice by processing audio signals and extracting relevant features.

Speaker Recognition Dataset from kaggle was selected.

This dataset contains speeches of these prominent leaders; Benjamin Netanyahu, Jens Stoltenberg, Julia Gillard, Margaret Thatcher and Nelson Mandela which also represents the folder names.
Each audio in the folder is a one-second 16,000 sample rate PCM encoded.
The speeches were collected from the American Rhetoric (online speech bank).
Each label has 1,500 samples.
The size of the entire dataset is 7,500 samples.

Classic spectrograms were generated using torchaudio.transforms.spectrogram() function from the pytorch audio library.
On training of the Resnet-18 model , with the tensor dataset and the regularisation techniques (SGD optimizer), with a learning rate between 1e-3 and 1e-5, 200 epochs, dropout probability between 0.4 and 0.8 and weight decay set to 0.01, we got 92.6% testing accuracy.

Mel Spectrograms were generated from torchaudio.transforms.MelSpectrogram() from the pytorch audio library.
On training of the Resnet-18 model , with the tensor dataset and the regularisation techniques (SGD optimizer), with a learning rate between 1e-3 and 1.5e-5, 325 epochs, dropout probability between 0.4 and 0.8 and weight decay set to 0.01, we got 93.75% testing accuracy.

MFCC features are the most commonly used features for speaker recognition.
MFCC features were generated using torchaudio.transforms.MFCC() function from the PyTorch audio library.
Once again, on training of the Resnet-18 model , with the tensor dataset and the regularisation techniques (SGD optimizer), with a learning rate between 1e-3 and 1e-4, 100 epochs, dropout probability between 0.4 and 0.8 and weight decay set to 0.01, we got 95.3% testing accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Features.py		Features.py
README.md		README.md
dataloader.py		dataloader.py
resnet18.py		resnet18.py

Provide feedback