PyTorch implementation of Learning Filterbanks from Raw Speech for Phone Recognition (ICASSP 2018).
Time-Domain Filterbanks (TD-filterbanks) are neural network layers intended to operate on a raw audio waveform. At initialization, they approximate standard mel-filterbanks by computing first-order scattering coefficients. They can then be fine-tuned with the architecture. Options of mel-filterbanks can be specified, such as a pre-emphasis layer, a log compression of the coefficients, or their mean-variance normalization.
There are four different modes for TD-filterbanks:
- Fixed: Initialize the layers to match mel-filterbanks and keep their parameters fixed when training the model
- Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
- Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
- Randinit: Initialize the layers randomly and learn them with the network
Time-Domain Filterbanks are a neural architecture composed of a complex-valued convolution, a modulus operator and a grouped real-valued convolution. This structure is based on the computation of first-order scattering coefficients. They are generated by a call to the class TDFbanks:
import melfilters
import utils
import model
# Main parameters
layer_params = dict(mode='fixed', # type of td-fbanks (fixed, learnall, learnfbanks)
nfilters=40, # number of filters
samplerate=16000, # samplerate of the waveform
wlen=25, # length of the window (in milliseconds)
wstride=10, # stride of the window
compression='log', # compression of coefficients (log or None)
preemp=True, # add a pre-emphasis layer below the td-fbanks
mvn=True) # perform mean-variance normalization per utterance on the coefficients
tdfbanks = model.TDFbanks(**layer_params)
When Time-Domain Filterbanks are generated, the weights of the convolutional layers are initialized randomly. With mode="learnall"
and without initialization, this corresponds to the randinit type of TD-filterbanks. One can initialize them to match standard mel-filterbanks:
# Initialization parameters
init_params = dict(min_freq=0, # minimum frequency spanned by the filters
max_freq=8000, # maximum frequency spanned by the filters
nfft=512, # number of frequency bins for the mel-filterbanks to replicate
window_type='hamming', # windowing function
normalize_energy=False, # replicate mel-filterbanks normalized or energy or that peak at 1
alpha=0.97) # pre-emphasis parameter
tdfbanks.initialize(**init_params)
Simply clone the repository:
git clone https://github.com/facebookresearch/tdfbanks.git
cd tdfbanks
If you find this code useful, please consider citing:
Learning Filterbanks from Raw Speech for Phone Recognition - N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve, E. Dupoux
@inproceedings{zeghidour2017learning,
title={Learning Filterbanks from Raw Speech for Phone Recognition},
author={Zeghidour, Neil and Usunier, Nicolas and Kokkinos, Iasonas and Schatz, Thomas and Synnaeve, Gabriel and Dupoux, Emmanuel},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on},
year={2018},
organization={IEEE}
}
Contact: [email protected]