Voicegain Platform Pricing

Last modified: June 17, 2020

Definitions

Before we present pricing let's define the terminology that will used in the pricing table.

API method types

Voicegain ASR API provides two methods to invoke speech-to-text:

sync method - intended for short audio up to 30 seconds. The result of recognition will be returned in the HTTP response.
async method - suitable for short and long audio, up to several hours. The initial HTTP request returns the details of the speech-to-text session that was started. This session can later be queried for results.

Speech-to-Text mode

Voicegain ASR API supports two modes for speech-to-text:

recognize - where the language model is provided by the grammars (e.g. GRXML) included in the request
transcribe - where the ASR engine uses a large vocabulary NLM (natural language model) which is part of Voicegain Platform

Acoustic Model types

Voicegain ASR uses two Acoustic Model types. The model type gets automatically chosen based on the API mehod (sync vs async), speech-to-text mode, and the Session Type (see below). The two models are:

high accuracy / high latency - this model ensures highest accuracy but the trade-off is longer processing latency (time to first response)
normal accuracy / low latency - this model is optimized for real-time use cases, but it needs to trade-off some of the accuracy in order to achieve low latency.

Standard vs Custom Acoustic Models

standard acoustic model - also called base model - this model is available to all users of voicegain platform. It has high accuracy, in particular the high accuracy / high latency version of it. The accuracy is better than standard acoustic models from Google, Mircosoft, and Amazon over wide range of speech types. It is not as good as the Google Video model.
custom acoustic model - this is a model that has been trained on customer data. In our tests we have observed about 5% decrease of WER (word error rate) after training/customization.

GPU Resource

API requests are assigned to one of the two types of GPU resource to use:

on-demand - this is the default GPU resource with high availability and SLA of 99.9%
preemptible - this is the GPU resource that may occasionally be subject to a restart. Voicegain platform ensures that despite the preemptibility of the underlying resource the request results will not be affected by that. The benefit of using this resource is lower price.

The assignment depends on the type of Session, see definition below.

Session types

When async speech-to-text is launched the following session types can to be chosen:

real-time - delivers fast response, suitable for IVR and live transcription, uses normal accuracy / low latency acoustic model
semi-real-time - delivers results with about 30 seconds delay, suitable for live transcription in streaming scenarios where the larger delay is ok, uses high accuracy / high latency acoustic model
offline - this session type puts recognition requests into a job queue and they are then processes on the available resources - still generally the processing time is much faster than the duration of the audio. Uses high accuracy / high latency acoustic model. Runs on preemeptible GPU resource for lower price, but the success is guaranteed, unlike preemptible (semi-)-real-time session.
offline-rt - this is a special session type tailored for bulk recognition with the same model as used for real-time. Main use case for this session type is regression testing for IVRs.

Note: it is possible to combine some of the Session Types in a single Web APi request. In such a case each session will be billed separately, e.g. if 10 minutes of audio is processed in HTTP request that starts both real-time and semi-real-time session, the final cost will be (1.25 + 1.25) * 10 = 25.0 cents.

Pricing for Speech-to-Text

Given the above definitions, below is the complete pricing list for all possible combinations. Basically, both offline session types have lower pricing, while the real-time or sync requests are priced 25% higher.

API method	Speech-to-Text mode	Session Type	Acoustic Model	GPU resource	Price: standard model	Price: custom model
sync	both	n/a	high accuracy high latency	on-demand	1.25 cents/minute	1.75 cents/minute
async	both	real-time	normal accuracy low latency	on-demand	1.25 cents/minute	1.75 cents/minute
async	both	semi-real-time	high accuracy high latency	on-demand	1.25 cents/minute	1.75 cents/minute
async	both	offline	high accuracy high latency	guaranteed preemptible	1.00 cents/minute	1.50 cents/minute
async	recognize	offline-rt	normal accuracy low latency	guaranteed preemptible	1.00 cents/minute	1.50 cents/minute

Pricing for IVR or MRCP Sessions

MRCP and/or IVR sessions are priced at 0.25 cent per minute. Any Speech Recognition during such session is priced on top of the MRCP/IVR session price. Speech Recognition time will be counted from the moment when speech has been detected (barge-in time) until either MATCH or NO-MATCH terminates the recognition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRICING.md

PRICING.md

Voicegain Platform Pricing

Definitions

API method types

Speech-to-Text mode

Acoustic Model types

Standard vs Custom Acoustic Models

GPU Resource

Session types

Pricing for Speech-to-Text

Pricing for IVR or MRCP Sessions

Files

PRICING.md

Latest commit

History

PRICING.md

File metadata and controls

Voicegain Platform Pricing

Definitions

API method types

Speech-to-Text mode

Acoustic Model types

Standard vs Custom Acoustic Models

GPU Resource

Session types

Pricing for Speech-to-Text

Pricing for IVR or MRCP Sessions