Rapid Voice-Based Biometric Authentication.
Lead: @lee-junseok.
- Describe benchmarks and metrics with state-of-the-art.
- Tie benchmarks into future work and how metrics can be improved.
- Collect and/or source an open-source benchmarking dataset (for noisy voice identification).
- Capture video of Membrane in action and pretty-up the repo with better structure and a project icon/logo.
OMIC.ai has a web-based biomedical/AI platform used for open-source COVID-19 research. Users currently log in via magic link, but the company would like to add an option where the person is identified via voice through the browser. Ideally the company would generate a phrase for the user to speak into a laptop/PC microphone, at which time the user would be rapidly authenticated and would be granted access.
Background noise has been one of the main challenges of a voice authentication system. Implementing denoising engineering and trained by synthetic data generated using both background noise augmentation and denoising, Membrane 2.0 is much more robust to ambient sounds than current models.
i.e. Recall improvement for a test dataset with loud background noise (Signal to Noise Ratio ~ N(mean = 5, std = 2.5)):
Slides for the project can be found here: Project Slides
Contributor: Jun Seok Lee.
jun branch is the default public branch. master will not be updated until the project becomes public
Deployed testing environment: http://membrane.insightai.me/:
For stability this demo only accepts an existing audio file on your device.
An overview of the pipeline of Membrane 2.0:
- https://github.com/CorentinJ/Real-Time-Voice-Cloning
- https://github.com/mozilla/DeepSpeech
- https://github.com/Janghyun1230/Speaker_Verification
- https://github.com/Dheeraj2444/speaker-identification
- https://github.com/wq2012/awesome-diarization
- https://voiceit.io/
- https://github.com/resemble-ai/Resemblyzer
- https://github.com/timsainb/noisereduce
There are a plethora of voice identification and transcription datasets publicly available, including FSDD, VoxCeleb, CommonVoice, LibreSpeech, etc. These existing datasets will require transfer learning and need to (optimistically) go through various real-world transformations to match production environments, which primarily include non-linear speech cadence/stuttering/filler words and static/dynamic environment background noise. In total, the datasets accumulate to over 2TB.
- An audio-visual dataset consisting of short clips (~5s) of human speech, extracted from interview videos uploaded to YouTube
VoxCeleb1 - 1251 speakers, 153,516 utterances, ~ 45 GB
VoxCeleb2 - 6,112 speakers, 1,128,246 utterances, ~ 65 GB
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox2_dev_aac_partah
wget --user=XX --password=YY http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
for
/vox1_dev_wav_partaa
...
/vox1_dev_wav_partah
/vox1_test_wav.zip
/vox2_dev_aac_partaa
...
/vox2_dev_aac_partah
/vox2_test_aac.zip
- A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances.
Source: https://www.openslr.org/45
Download (351M)
- Large-scale corpus of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.
Sourse: http://www.openslr.org/12
train-other-500.tar.gz (30G), test-other.tar.gz (328M)
- Private end users recordings in each environment (i.e. cafe, street, wind, near TV).
The data provided for the development of this model is highly protected; access to this data is permissable only by the signing of our Nondisclosure Agreement and under no condition can be distributed outside of Omic, Inc.
Software developed on top of this data, however, are openly shareable, so long as they do not immediately pose a risk of the above data policy.
tl;dr don't share data and be mindful with sharing how you interface with the data.
model/toy_modelv1/
: Path for a toy-model.
model/modelv2/
: Main Membrane 2.0 model modules and its earlier model (VBBA).
membrane.py
: Main Membrane 2.0 script to run.
-
Final stage codes will not be updated until the project becomes public
-
Final weights for modelv2 (VBBA, Membrane) will not be available until the project becomes public
We would like a voiceprint authentication model that isolates the user's voice based on the prescribed phrase and then authenticates the user based on a voice match to the training phrase.
As an example, we could provide several training phrases for the user to speak in our application through the laptop microphone. We would record these phrases, then provide a new login phrase for them to read and log in. The person’s voice would need to be isolated from background noise.
The model would score the new login phrase - identifying the person of closest match and false positive/negative (precision/recall) data for understanding false positive and negative rates. The goal would be for these to score at least as well as common voice recognition models that currently exist.
While final weights will not be available until the project becomes public here I am providing intermediate weights of models for model testing:
toy_modelv1
: Download to model/toy_modelv1/data/weights/
modelv2
: Download to model/modelv2/checkpoints/
Note: If run into an audio library/dependency error while running the model, please also follow Dependencies.
model/toy_modelv1/
Pre-trained model by VoxCeleb1 dataset.
- Install python3.7.7 and the required packages, i.e.
conda create -n membrane_toy python=3.7.7 pip
python -m pip install -r requirements.txt
- To run:
(in the directory model/toy_modelv1/
)
python3 verification.py verify --input [input file] --test [test file] --metric [metric function (default:'cosine')] --threshold [threshold of metric function for verification (default:0.1)]
An example:
python3 verification.py verify --input data/enroll/19-enroll.wav --test data/test/19.wav --metric 'cosine' --threshold 0.1
- Results will be stored in
res/results.csv
. Each line has format:[input file name], [test file name], [metric function], [distance], [threshold], [verified?]
model/modelv2/
Voice-Based Biometric Authetication.
Repurposed Voice Verification Model Based on Toy Model (Pre-trained VGGVox1).
- Install python3.7.7 and the required packages
conda create -n membrane python=3.7.7 pip
python -m pip install -r requirements.txt
- Extract training data into
/wav_train_subset
- Move some subset of data of test users into
/wav_test_subset
- To run:
(in the directory model/modelv2/
)
python3 VBBA.py (optional argument)
(optional arguments):
-h, --help show this help message and exit
-l, --list-current-users
Show current enrolled users
-e, --enroll Enroll a new user
-v, --verify Verify a user from the ones in the database
-i, --identify Identify a user
-d, --delete Delete user from database
-c, --clear Clear Database
-u, --username USERNAME
-f, --with-file Provide a recording file rather than record
- Results will be stored in
speaker_models.pkl
. - More details in
model/modelv2/README.md
.
Command Line Model
Modules in: model/modelv2/
Main Script: membrane.py
Speech-to-Text Identification and Voice-Based Biometric Authentication with Streaming Audio
- Install python3.7.7 and the required packages
conda create -n membrane python=3.7.7 pip
python -m pip install -r requirements.txt
- For hyperparameters and global constants, check
utils.py
- Download model weights to
model/modelv2/checkpoints/
: Download
model/modelv2/deepspeech_data/
: Download1, Download2 (please download both)
- To run:
(in the root directory)
python3 membrane.py
Instructions follow on user's terminal, i.e.
--"Please type 'enroll' or 'e' to enroll a new user, type 'verify' or 'v' to verify an enrolled user:"
- Public mode (shown in the terminal instructions):
enroll or e
: Enroll a user (command line input follows) -- "Please type your username:"
verify or v
: Verify a user
- Administrator mode (not shown in the terminal instructions):
delete or d
: Delete a user (command line input follows) -- "Please type username to delete:"
clear or c
: Clear all users
file or f
: Use an existing file instead of the streaming recorder (command line input follows) -- 'Please input file path:'
- Results will be stored in
speaker_models.pkl
andspeaker_phrases.pkl
. - More details in
model/modelv2/README.md
.
If run into an audio library/dependency error, please try this (Linux):
apt-get update && \
apt-get install -qq -y gcc make \
apt-transport-https ca-certificates build-essential \
libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 libav-tools alsa-utils
I thank Alok Deshpande for reviewing codes and useful suggestions.
For questions or comments, feel free to reach out to [email protected]. Thank you!