Factorized-TDNN

PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks"[1]. This is also known as TDNN-F in nnet3 of Kaldi.

Taken from [1]

A TDNN-F layer is implemented in the class FTDNNLayer of models.py. To be specific to the description in [1], it is an implementation of the "3-stage splicing" implementation, in which three convolutions are used in sequence, with the first two being constrained to be semi-orthogonal. These convolutions are followed by a ReLU and then BatchNorm layer. The semi-orthogonal constraint is the "floating case" in [1]. (TODO: implement the scaled case like in Kaldi)

Usage

`FTDNNLayer`

This FTDNNLayer of models.py is used as follows:

import torch
from models import FTDNNLayer, SOrthConv

tdnn_f = FTDNNLayer(1280, 512, 256, context_size=2, dilations=[2,2,2], paddings=[1,1,1])
# This is a sequence of three 2x1 convolutions
# dimensions go from 1280 -> 256 -> 256 -> 512
# dilations and paddings handles how much to dilate and pad each convolution
# Having these configurable is to ensure the sequence length stays the same

test_input = torch.rand(5, 100, 1280)
# inputs to the FTDNNLayer must be (batch_size, seq_len, in_dim)

tdnn_f(test_input).shape # returns (5, 100, 512)

tdnn_f.step_semi_orth() # The key method to constrain the first two convolutions, perform after every SGD step

tdnn_f.orth_error() # This returns the orth error of the constrained convs, useful for debugging

`SOrthConv`

The components of FTDNNLayer which have the semi-orthogonal constraint are based around the class SOrthConv, which is essentially a nn.Conv1d with a .step_semi_orth() method to perform the semi-orthogonal update as in [1].

sorth_conv = SOrthConv(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, padding_mode='zeros')

The implementation of the .step_semi_orth() method has been made to be as close to ConstrainOrthonormalInternal from nnet-utils.cc in Kaldi's nnet3 module.

Extras

Also included in this repo in models.py is the following:

FTDNN: Factorized TDNN x-vector architecture (FTDNN) up to the embedding layer seen in "State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations"[2]. (This is not EXACTLY the same, but should be close enough).
SharedDimScaleDropout: The shared dimension scaled dropout described in [1] and in Kaldi:
- Instead of randomly setting inputs to 0, use a continuous dropout scale.
- For a dropout 'strength' alpha, multiply inputs inputs by a mask sampled from the uniform distribution on the interval [1 - 2 * alpha, 1 + 2 * alpha].
- Share dropout masks along a dimension, such as time. From [1]: "If, for instance, a dimension is zeroed on a particular frame it will be zeroed on all frames of that sequence".

The FTDNN x-vector architecture description taken from [2]. Up until layer 12 is implemented in FTDNN in models.py.

Demo [WIP]

An demonstration of the FTDNN model being trained can be seen in the following output log (code not included, TODO: basic experiment demo):

exp/sp_ftdnn_bl: Wed Nov 20 14:21:15 2019: [10/120000]   C-Loss:21.9116, AvgLoss:21.6991, lr: 0.2, bs: 400
Orth error: 22.44341427081963
exp/sp_ftdnn_bl: Wed Nov 20 14:21:29 2019: [20/120000]   C-Loss:21.6260, AvgLoss:21.7459, lr: 0.2, bs: 400
Orth error: 8.235212338215206
exp/sp_ftdnn_bl: Wed Nov 20 14:21:43 2019: [30/120000]   C-Loss:21.7663, AvgLoss:21.7525, lr: 0.2, bs: 400
Orth error: 1.2611256236341433
exp/sp_ftdnn_bl: Wed Nov 20 14:21:56 2019: [40/120000]   C-Loss:21.6153, AvgLoss:21.6527, lr: 0.2, bs: 400
Orth error: 0.005309408872562926
exp/sp_ftdnn_bl: Wed Nov 20 14:22:14 2019: [50/120000]   C-Loss:21.0997, AvgLoss:21.5722, lr: 0.2, bs: 400
Orth error: 0.005543942232179688
exp/sp_ftdnn_bl: Wed Nov 20 14:22:26 2019: [60/120000]   C-Loss:21.2629, AvgLoss:21.5222, lr: 0.2, bs: 400
Orth error: 0.004769200691953301
exp/sp_ftdnn_bl: Wed Nov 20 14:22:40 2019: [70/120000]   C-Loss:20.9551, AvgLoss:21.4158, lr: 0.2, bs: 400
Orth error: 0.006055477493646322
exp/sp_ftdnn_bl: Wed Nov 20 14:22:56 2019: [80/120000]   C-Loss:20.4425, AvgLoss:21.3274, lr: 0.2, bs: 400
Orth error: 0.009634702852054033
exp/sp_ftdnn_bl: Wed Nov 20 14:23:09 2019: [90/120000]   C-Loss:21.0025, AvgLoss:21.2727, lr: 0.2, bs: 400
Orth error: 0.00611297079740325
exp/sp_ftdnn_bl: Wed Nov 20 14:23:25 2019: [100/120000]          C-Loss:20.6145, AvgLoss:21.1736, lr: 0.2, bs: 400
Orth error: 0.008151484609697945
exp/sp_ftdnn_bl: Wed Nov 20 14:23:38 2019: [110/120000]          C-Loss:20.1985, AvgLoss:21.0890, lr: 0.2, bs: 400
Orth error: 0.0072971017434610985
exp/sp_ftdnn_bl: Wed Nov 20 14:23:53 2019: [120/120000]          C-Loss:20.5698, AvgLoss:21.0300, lr: 0.2, bs: 400
Orth error: 0.00629939052669215
exp/sp_ftdnn_bl: Wed Nov 20 14:24:08 2019: [130/120000]          C-Loss:20.2024, AvgLoss:20.9425, lr: 0.2, bs: 400
Orth error: 0.008707787481398555
exp/sp_ftdnn_bl: Wed Nov 20 14:24:21 2019: [140/120000]          C-Loss:19.7034, AvgLoss:20.8641, lr: 0.2, bs: 400
Orth error: 0.010941843771433923
exp/sp_ftdnn_bl: Wed Nov 20 14:24:37 2019: [150/120000]          C-Loss:19.9718, AvgLoss:20.8035, lr: 0.2, bs: 400
Orth error: 0.00768740743296803

The FTDNN x-vector architecture seems to train successfully, and most importantly the Orth error is minimized.

TODOs

Implement 'scaled' case of semi-orthogonal constraint
Refactor so that seq_len is final dim (or not?)
Simple experiment/toy demo

References

[1]
@inproceedings{Povey2018,
  author={Daniel Povey and Gaofeng Cheng and Yiming Wang and Ke Li and Hainan Xu and Mahsa Yarmohammadi and Sanjeev Khudanpur},
  title={Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3743--3747},
  doi={10.21437/Interspeech.2018-1417},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1417}
}

[2]
@article{VILLALBA2020101026,
    title = "State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations",
    journal = "Computer Speech & Language",
    volume = "60",
    pages = "101026",
    year = "2020",
    issn = "0885-2308",
    doi = "https://doi.org/10.1016/j.csl.2019.101026",
    url = "http://www.sciencedirect.com/science/article/pii/S0885230819302700",
    author = "Jesús Villalba and Nanxin Chen and David Snyder and Daniel Garcia-Romero and Alan McCree and Gregory Sell and Jonas Borgstrom and Leibny Paola García-Perera and Fred Richardson and Réda Dehak and Pedro A. Torres-Carrasquillo and Najim Dehak"
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
models.py		models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Factorized-TDNN

Usage

`FTDNNLayer`

`SOrthConv`

Extras

Demo [WIP]

TODOs

References

About

Releases

Packages

Languages

License

cvqluu/Factorized-TDNN

Folders and files

Latest commit

History

Repository files navigation

Factorized-TDNN

Usage

FTDNNLayer

SOrthConv

Extras

Demo [WIP]

TODOs

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`FTDNNLayer`

`SOrthConv`

Packages