An ASR (Automatic Speech Recognition) Service based on kaldi
The model for decoding is trained by CVTE.
This service is intended to decode Mandarin Chinese speech into Chinese text.
This service is aimed to help you to build a dockerized ASR app free of any trouble, just follow the steps and wait for the downloading and compiling.
This server runs on Ubuntu 16.04 and similar Linux systems, with large memory capacity >= 64G.
Install docker-ce >= 18.06.0
If you have installed docker, just
cd AsrService
mkdir Kaldi/Bin
chmod +x run.sh
./run.sh deploy
And the build will start. It will pull some images like microsoft/dotnet:aspnetcore-runtime
and microsoft/dotnet:sdk
.
After the build of the model is complete, just quit. run.sh
will try to run it and it will fail because steps are not finished.
It will build kaldi first, which may cost a lot of time and fail due to not enough memory. After kaldi echos done
, your terminal may lag for a few minutes but it does not matter, just sit.
After kaldi is built successfully, it will download some other easier dependencies.
And finally, it will compile the ASP.NET app as the web interface, as described below. Finally it will build decoder.
http://kaldi-asr.org/models/m2
If you have downloaded CVTE model, the tree is like:
cvte
└──────s5
├───conf
├───data
│ ├───fbank
│ │ └───test
│ │ └───split1
│ │ └───1
│ └───wav
│ └───00030
├───exp
│ └───chain
│ └───tdnn
│ ├───decode_test
│ │ ├───log
│ │ └───scoring_kaldi
│ │ ├───log
│ │ ├───penalty_0.0
│ │ │ └───log
│ │ ├───penalty_0.5
│ │ │ └───log
│ │ ├───penalty_1.0
│ │ │ └───log
│ │ └───wer_details
│ └───graph
├───fbank
│ └───test
└───local
Now copy or link the cvte
to AsrService/Tools/cvte
, then everything is done. Note that Tools
contains cvte
directly.
Only one step left :
./run.sh deploy
and wait with patience. When
Service is ready
is shown, the model is set.
Then you can visit the server from remote, with the rest api as following.
If you try to debug, use
./run.sh interact
and it will let you play with the docker interactively.
To start service from inside container, run /app/init.sh
.
This service takes in speech in *.wav format, which should be exactly 16 KHz and 16-bit.
URL: POST <host>/Decoding/Decode
The server accepts POST request, with field of wave
in the body, carrying the wave file with standards above.
The server sends response in following format:
{
"Id": "6b239c12-a5dc-47f8-a17a-c31a6e34bb7e",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
Check if response['Message'] == 'Ok'. if not, there is a failure.
The first field Id
is the id generated for the speech (GUID), with which we can do some tracing in the log;
The second field Text
is the Chinese text of the given speech.
The third field Message
carries some messages if the decoding failed.
for example:
{
"Id": "eae8e7cc-55a7-4404-a44f-411452da5dd3",
"Text": "",
"Message": "the wav file is broken or not supported\nbad wave\n"
}
To test the API, use
./test.sh
This will run APITest/autoapi.py
with different parameters. If this keeps failing, consider modify some configurations to make it faster.
An expected result is:
python3 APITest/autoapi.py single_normal
{
"Id": "35254669-40ed-44b2-9ef6-9bd333a20c20",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
python3 APITest/autoapi.py parallel
{
"Id": "7d27495a-7f3e-4db5-9b4b-c42213536cbc",
"Text": "关于与还款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "577c152b-ef50-4c52-87a8-1d03e690b35a",
"Text": "关于还款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "b6d4dfa4-175c-434a-942a-1b6654c2c4d3",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "8481b5cc-46e1-41f9-9916-59528181eda1",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "78b6de10-b0e1-48b5-9ea7-b205cacb1916",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "90fe65fb-7cc1-49ec-8cc1-bd546e1ee287",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "b625ba39-c257-404d-b6ef-45f46aad87f0",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "352b8e49-b97e-4b12-afdc-88b928fb07e1",
"Text": "关于黄款汇率希望大家不要被误导当然这歌火系的答案并不对",
"Message": "Ok"
}
{
"Id": "2bfa4f8e-c174-43d8-9985-9c7043cada57",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "a0175f9d-11fe-425e-a4b2-1766e6c9b50f",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
avg rtt: 5.822241s
python3 APITest/autoapi.py empty
{
"Id": "eae8e7cc-55a7-4404-a44f-411452da5dd3",
"Text": "",
"Message": "the wav file is broken or not supported\nbad wave\n"
}
python3 APITest/autoapi.py bad_url
{
"Id": "",
"Text": "",
"Message": "Usage: POST <host_name>/Decoding/Decode\nContent-Disposition: form-data; name=\"wave\"; filename=\"your.wav\"\nContent-Type: audio/x-wav\n\n<binary_contents>"
}
python3 APITest/autoapi.py big
{
"Id": "59171039-555d-4dfb-bb1e-f10ea3914356",
"Text": "",
"Message": "The wave file should be smaller than 1000000 bytes"
}
This server comprises mainly three parts: (1) ASP.NET web interface, (2) a message queue implemented with redis, (3) decoder implemented with kaldi .
(1) is implemented in AsrService/Service
and AsrService/Core
with .NET Core 2.0 in C# language.
(3) is implemented in AsrService/Kaldi
with C++ calling an Kaldi API packed by myself in https://github.com/hanayashiki/kaldi
.
-
the web interface accepts post requests as decribed above and sends message
Speech
to the message queue. In fact the message queue consists of two redis lists. ListSpeechQueue
is for web interface to placeSpeech
and the decoder will fetchSpeech
fromSpeechQueue
. Every time the web interface places aSpeech
, it will signify the decoder to work by publishing a blank message on redis channelSpeechSignal
and the decoder will know to get to work. -
the message queue consists of two queues and two channels. The aforementioned queue
SpeechQueue
, is for the web interface to place speech works and for decoder to fetch them. And another queueTextQueue
, is for the decoder to provide decoding result and for the web interface to fetch them. Each queue has a corresponding channel, namelySpeechSignal
andTextSignal
, which is used to wake up threads of web interface waiting for decoding result and threads of decoder for decoding works. -
the decoder consists of a decoding model and a few threads. The threads wait for signal of
SpeechSignal
and will use the shared decoding model to decode the wave. This will use an API implemented in another repository, which is a little encapsulation for kaldi's online decoding. For more information about decoding, see kaldi's documentation.
For the minimalistic control, the service uses three configuration files.
async_decoder_config.json
configures "(1) web interface"
- ServiceConf/async_decoder_config.json
This configuration is for "(1) web interface". Run web interface with
{ "ClearRedis": true, // Where we execute `FLUSHALL` to redis "MaxRetrials": 3, // The maximum time of retrials of getting message from decoder "DecodingTimeoutMs": 300000, // Longest waiting time after a single `Speech` is put into the queue "Redis": { // redis connection "IP": "127.0.0.1", "Port": 6379 }, "EnableFakeDecoder": false, // `enable fake decoder` means another decoder, will listen on `SpeechSignal` for work and provides fast but fake results. This is useful to test only "(1) web service" and "(2) message queue"`. "WorkerThreads": 300, // This affects ThreadPool.SetMinThreads(config["WorkerThreads"], "CompletionPortThreads": 300 // config["CompletionPortThreads"]); }
cd AsrService dotnet Service/bin/Debug/netcoreapp2.1/publish/Service.dll --decoder-config=path/to/async_decoder_config.json
The following files configures "(3) decoder"
-
KaldiConf/kaldi_config.json
{ "max_wave_size_byte": 1000000, // The maximum size of a wave file "redis_key": "", // Path to the key "redis_port": 6379, "redis_server": "127.0.0.1", "thread_expires_after_decoding": true, // Whether the thread quits looping for works. This is useful if you want to detect memory leaks with valgrind or something. "use_kaldi": true, // Whether we use kaldi. If `false`, we will save a lot of time from loading the model. "workers": 100 // How many threads are doing works. Parallelism can accommodate more requests. }
-
KaldiConf/decoder_config.json
This configuration file is very tricky and most of the explanation must be found in http://kaldi-asr.org/doc/online_decoding.html, because this is generally transcripting shell arguments to json. But you can follow the steps above and just modify some parameter for adjustment of performance.
{
"acoustic_scale": 1.0,
"add_pitch": false,
"beam": 3.0, // the bigger, the better result, the slower
"chunk_length_secs": -1.0,
"config": "KaldiConf/online.conf",
"do_endpointing": false,
"feature_type": "fbank",
"frames_per_chunk": 50,
"fst_in": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/graph/HCLG.fst",
"lattice_beam": 3.0,
"lattice_wspecifier": "ark:/dev/null",
"max_active": 200, // the bigger, the better result, the slower
"nnet3_in": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/final.mdl",
"num_threads_startup": 4, // maybe set to the core number of your CPU.
"online": false,
"sample_freq": 16000.0,
"word_symbol_table": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/graph/words.txt"
}
To run decoder with the configurations above, use (inside the container):
cd AsrService/Kaldi
./kaldi-service path/to/kaldi_config.json path/to/decoder_config.json
Well, before I can dig more into Kaldi, the performance is rather poor, see:
(10 parallel requests, localhost, 8 CPUs of Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz)
{
"Id": "86e96461-0415-4e6f-8bd6-4739858f2ede",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "bc44ac47-2d24-4f89-8812-0bc822591091",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "323c02cb-3727-44f6-b77b-8237e7615f3c",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "2116eb68-47f8-4fa2-bfa4-2766f93426fc",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "b77e6d58-be8d-4b6a-b6dd-866a5d9263ef",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "e9b13ce1-7f92-4f49-a5be-fc99c4c9a009",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "f6f2173e-2814-4af4-8990-84779a145043",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "6b239c12-a5dc-47f8-a17a-c31a6e34bb7e",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "13ba7bf3-d8e4-421b-8cb8-75172b5f99b8",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
{
"Id": "fca8ed3c-e4e9-435d-9930-39119529d370",
"Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
"Message": "Ok"
}
avg rtt: 5.460889s
And the cost is even worse in a container.
Logs are always kept under AsrService/Logs
, which is shared with the container in shape of volume
.
Logs/kaldi.log
stores logs about decoder and kaldi.
Logs/web.log
stores logs about web interface
Logs/redis.log
stores logs about redis.