Skip to content

Latest commit

 

History

History
383 lines (314 loc) · 14.1 KB

README.md

File metadata and controls

383 lines (314 loc) · 14.1 KB

An ASR (Automatic Speech Recognition) Service based on kaldi
The model for decoding is trained by CVTE.

Purpose

This service is intended to decode Mandarin Chinese speech into Chinese text.

How to deploy

Build docker image

This service is aimed to help you to build a dockerized ASR app free of any trouble, just follow the steps and wait for the downloading and compiling.

This server runs on Ubuntu 16.04 and similar Linux systems, with large memory capacity >= 64G.

Install docker-ce >= 18.06.0

If you have installed docker, just

cd AsrService
mkdir Kaldi/Bin
chmod +x run.sh
./run.sh deploy

And the build will start. It will pull some images like microsoft/dotnet:aspnetcore-runtime and microsoft/dotnet:sdk. After the build of the model is complete, just quit. run.sh will try to run it and it will fail because steps are not finished.

It will build kaldi first, which may cost a lot of time and fail due to not enough memory. After kaldi echos done, your terminal may lag for a few minutes but it does not matter, just sit.

After kaldi is built successfully, it will download some other easier dependencies.

And finally, it will compile the ASP.NET app as the web interface, as described below. Finally it will build decoder.

Download CVTE model

http://kaldi-asr.org/models/m2

If you have downloaded CVTE model, the tree is like:

cvte
└──────s5
        ├───conf
        ├───data
        │   ├───fbank
        │   │   └───test
        │   │       └───split1
        │   │           └───1
        │   └───wav
        │       └───00030
        ├───exp
        │   └───chain
        │       └───tdnn
        │           ├───decode_test
        │           │   ├───log
        │           │   └───scoring_kaldi
        │           │       ├───log
        │           │       ├───penalty_0.0
        │           │       │   └───log
        │           │       ├───penalty_0.5
        │           │       │   └───log
        │           │       ├───penalty_1.0
        │           │       │   └───log
        │           │       └───wer_details
        │           └───graph
        ├───fbank
        │   └───test
        └───local

Now copy or link the cvte to AsrService/Tools/cvte, then everything is done. Note that Tools contains cvte directly.

Start service

Only one step left :

./run.sh deploy

and wait with patience. When

Service is ready

is shown, the model is set.

Then you can visit the server from remote, with the rest api as following.

If you try to debug, use

./run.sh interact

and it will let you play with the docker interactively.

To start service from inside container, run /app/init.sh.

Rest API

This service takes in speech in *.wav format, which should be exactly 16 KHz and 16-bit.

URL: POST <host>/Decoding/Decode

The server accepts POST request, with field of wave in the body, carrying the wave file with standards above.

The server sends response in following format:

{
  "Id": "6b239c12-a5dc-47f8-a17a-c31a6e34bb7e",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}

Check if response['Message'] == 'Ok'. if not, there is a failure.

The first field Id is the id generated for the speech (GUID), with which we can do some tracing in the log;

The second field Text is the Chinese text of the given speech.

The third field Message carries some messages if the decoding failed.

for example:

{
  "Id": "eae8e7cc-55a7-4404-a44f-411452da5dd3",
  "Text": "",
  "Message": "the wav file is broken or not supported\nbad wave\n"
}

To test the API, use

./test.sh

This will run APITest/autoapi.py with different parameters. If this keeps failing, consider modify some configurations to make it faster.

An expected result is:

python3 APITest/autoapi.py single_normal
{
  "Id": "35254669-40ed-44b2-9ef6-9bd333a20c20",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
python3 APITest/autoapi.py parallel
{
  "Id": "7d27495a-7f3e-4db5-9b4b-c42213536cbc",
  "Text": "关于与还款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "577c152b-ef50-4c52-87a8-1d03e690b35a",
  "Text": "关于还款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "b6d4dfa4-175c-434a-942a-1b6654c2c4d3",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "8481b5cc-46e1-41f9-9916-59528181eda1",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "78b6de10-b0e1-48b5-9ea7-b205cacb1916",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "90fe65fb-7cc1-49ec-8cc1-bd546e1ee287",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "b625ba39-c257-404d-b6ef-45f46aad87f0",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "352b8e49-b97e-4b12-afdc-88b928fb07e1",
  "Text": "关于黄款汇率希望大家不要被误导当然这歌火系的答案并不对",
  "Message": "Ok"
}
{
  "Id": "2bfa4f8e-c174-43d8-9985-9c7043cada57",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "a0175f9d-11fe-425e-a4b2-1766e6c9b50f",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
avg rtt: 5.822241s
python3 APITest/autoapi.py empty
{
  "Id": "eae8e7cc-55a7-4404-a44f-411452da5dd3",
  "Text": "",
  "Message": "the wav file is broken or not supported\nbad wave\n"
}
python3 APITest/autoapi.py bad_url
{
  "Id": "",
  "Text": "",
  "Message": "Usage: POST <host_name>/Decoding/Decode\nContent-Disposition: form-data; name=\"wave\"; filename=\"your.wav\"\nContent-Type: audio/x-wav\n\n<binary_contents>"
}
python3 APITest/autoapi.py big
{
  "Id": "59171039-555d-4dfb-bb1e-f10ea3914356",
  "Text": "",
  "Message": "The wave file should be smaller than 1000000 bytes"
}

Structure

This server comprises mainly three parts: (1) ASP.NET web interface, (2) a message queue implemented with redis, (3) decoder implemented with kaldi .

(1) is implemented in AsrService/Service and AsrService/Core with .NET Core 2.0 in C# language. (3) is implemented in AsrService/Kaldi with C++ calling an Kaldi API packed by myself in https://github.com/hanayashiki/kaldi.

  • the web interface accepts post requests as decribed above and sends message Speech to the message queue. In fact the message queue consists of two redis lists. List SpeechQueue is for web interface to place Speech and the decoder will fetch Speech from SpeechQueue. Every time the web interface places a Speech, it will signify the decoder to work by publishing a blank message on redis channel SpeechSignal and the decoder will know to get to work.

  • the message queue consists of two queues and two channels. The aforementioned queue SpeechQueue, is for the web interface to place speech works and for decoder to fetch them. And another queue TextQueue, is for the decoder to provide decoding result and for the web interface to fetch them. Each queue has a corresponding channel, namely SpeechSignal and TextSignal, which is used to wake up threads of web interface waiting for decoding result and threads of decoder for decoding works.

  • the decoder consists of a decoding model and a few threads. The threads wait for signal of SpeechSignal and will use the shared decoding model to decode the wave. This will use an API implemented in another repository, which is a little encapsulation for kaldi's online decoding. For more information about decoding, see kaldi's documentation.

Configuration

For the minimalistic control, the service uses three configuration files.

async_decoder_config.json configures "(1) web interface"

  • ServiceConf/async_decoder_config.json
    {
      "ClearRedis": true,                         // Where we execute `FLUSHALL` to redis
      "MaxRetrials": 3,                           // The maximum time of retrials of getting message from decoder 
      "DecodingTimeoutMs": 300000,                // Longest waiting time after a single `Speech` is put into the queue 
      "Redis": {                                  // redis connection
        "IP": "127.0.0.1",
        "Port": 6379
      },
      "EnableFakeDecoder": false,                  // `enable fake decoder` means another decoder, will listen on `SpeechSignal` for work and provides fast but fake results. This is useful to test only "(1) web service" and "(2) message queue"`.
      "WorkerThreads": 300,                        // This affects ThreadPool.SetMinThreads(config["WorkerThreads"],
      "CompletionPortThreads": 300                 //     config["CompletionPortThreads"]);
    }
    This configuration is for "(1) web interface". Run web interface with
    cd AsrService
    dotnet Service/bin/Debug/netcoreapp2.1/publish/Service.dll --decoder-config=path/to/async_decoder_config.json
    

The following files configures "(3) decoder"

  • KaldiConf/kaldi_config.json

    {
        "max_wave_size_byte": 1000000,            // The maximum size of a wave file
        "redis_key": "",                          // Path to the key
        "redis_port": 6379,                       
        "redis_server": "127.0.0.1",
        "thread_expires_after_decoding": true,    // Whether the thread quits looping for works. This is useful if you want to detect memory leaks with valgrind or something.
        "use_kaldi": true,                        // Whether we use kaldi. If `false`, we will save a lot of time from loading the model.
        "workers": 100                            // How many threads are doing works. Parallelism can accommodate more requests. 
    }
  • KaldiConf/decoder_config.json

This configuration file is very tricky and most of the explanation must be found in http://kaldi-asr.org/doc/online_decoding.html, because this is generally transcripting shell arguments to json. But you can follow the steps above and just modify some parameter for adjustment of performance.

  {
      "acoustic_scale": 1.0,
      "add_pitch": false,
      "beam": 3.0,    // the bigger, the better result, the slower
      "chunk_length_secs": -1.0,
      "config": "KaldiConf/online.conf",
      "do_endpointing": false,
      "feature_type": "fbank",
      "frames_per_chunk": 50,
      "fst_in": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/graph/HCLG.fst",
      "lattice_beam": 3.0,
      "lattice_wspecifier": "ark:/dev/null",
      "max_active": 200,   // the bigger, the better result, the slower
      "nnet3_in": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/final.mdl",
      "num_threads_startup": 4,  // maybe set to the core number of your CPU.
      "online": false,
      "sample_freq": 16000.0,  
      "word_symbol_table": "/usr/local/kaldi/egs/cvte/s5/exp/chain/tdnn/graph/words.txt"
  }

To run decoder with the configurations above, use (inside the container):

cd AsrService/Kaldi
./kaldi-service path/to/kaldi_config.json path/to/decoder_config.json

Performance

Well, before I can dig more into Kaldi, the performance is rather poor, see:

(10 parallel requests, localhost, 8 CPUs of Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz)

{
  "Id": "86e96461-0415-4e6f-8bd6-4739858f2ede",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "bc44ac47-2d24-4f89-8812-0bc822591091",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "323c02cb-3727-44f6-b77b-8237e7615f3c",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "2116eb68-47f8-4fa2-bfa4-2766f93426fc",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "b77e6d58-be8d-4b6a-b6dd-866a5d9263ef",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "e9b13ce1-7f92-4f49-a5be-fc99c4c9a009",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "f6f2173e-2814-4af4-8990-84779a145043",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "6b239c12-a5dc-47f8-a17a-c31a6e34bb7e",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "13ba7bf3-d8e4-421b-8cb8-75172b5f99b8",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
{
  "Id": "fca8ed3c-e4e9-435d-9930-39119529d370",
  "Text": "关于黄款汇率希望大家不要被误导当然这个火鸡的答案并不对",
  "Message": "Ok"
}
avg rtt: 5.460889s 

And the cost is even worse in a container.

Logging

Logs are always kept under AsrService/Logs, which is shared with the container in shape of volume.

Logs/kaldi.log stores logs about decoder and kaldi. Logs/web.log stores logs about web interface Logs/redis.log stores logs about redis.