Real-Time Transcription takes the audio content of a host's media stream and transcribes it into written words in real time. This page shows you how to start and stop Real-Time Transcription in your app, through a business server, then display the text in your app.
To start transcribing the audio in a channel in real-time, you send an HTTP
request to the Agora SD-RTN™
through your business server. Real-Time Transcription provides the following modes:
- Transcribe speech in real-time, then stream this data to the channel.
- Transcribe speech in real-time, store the text in the
WebVTT
format, and upload the file to third-party cloud storage.
Real-Time Transcription transcribes at most three speakers in a channel. When there are more than three speakers, the top three are selected based on volume, and their audio is transcribed.
The following figure shows the workflow to start, query, and stop a Real-Time Transcription task:
In order to use the RESTful API to transcribe speech, make the following calls:
acquire
: Request abuilderToken
that authenticates the user and gives permission to start Real-Time Transcription . You must callstart
using thisbuilderToken
within five minutes.start
: Begin the transcription task. Once you start a task,builderToken
remains valid for the entire session. Use the samebuilderToken
to query and stop the task.query
: Check the task status.stop
: Stop the transcription task.
In order to set up Real-Time Transcription in your app, you must have:
- Implemented Get Started with Video Calling
- Enabled Real-Time Transcription for your project. Contact [email protected]
- Activated a supported cloud storage service to record and store Real-Time Transcription videos and texts
- Installed the Protobuf package to generate code classes for displaying transcription text.
- To run the post-processing script:
- Python 3.0
ffmpeg
andffplay
You create a business server as a bridge between your app and Agora Real-Time Transcription. Implementing a business server to manage Real-Time Transcription provides the following benefits:
- Improved security as your
apiKey
,apiSecret
,builderToken
, andtaskId
, are not exposed to the client. - Token processing is securely handled on the business server.
- Avoid splicing complex request body strings on the client side to reduce the probability of errors.
- Implement additional functionality on the business server. For example, billing for Real-Time Transcription use, checking user privileges and payment status of a user.
- If the REST API is updated, you do not need to update the client.
To import the API collection for testing and to obtain sample code for your business server, see the Postman Collection.
Google Protocol buffers are an extensible and language-neutral mechanism for serializing transcription data. Protobuffer enables you to generate source code in multiple languages, based on a specified structure. For more information about Google protocol buffers, see protobuf.dev.
Agora provides the following protobuffer template for parsing Real-Time Transcription data:
syntax = "proto3";
package Agora.audio2text;
option java_package = "io.Agora.rtc.audio2text";
option java_outer_classname = "Audio2TextProtobuffer";
message Text {
int32 vendor = 1;
int32 version = 2;
int32 seqnum = 3;
int32 uid = 4;
int32 flag = 5;
int64 time = 6;
int32 lang = 7;
int32 starttime = 8;
int32 offtime = 9;
repeated Word words = 10;
}
message Word {
string text = 1;
int32 start_ms = 2;
int32 duration_ms = 3;
bool is_final = 4;
double confidence = 5;
}
To read and display the Real-Time Transcription text in your client:
-
Copy the protobuffer template to a local file.
-
In your local file, edit the following properties to match your project:
package
: The source code package namespace.option
: The language for which you want to generate the class. For example, Java or Javascript.
-
You invoke the
protoc
protocol compiler on your local file.
Agora also provides Protobuf sample code to parse and display transcription text. To obtain the sample code, contact [email protected]
The m3u8+vtt
file generated by Real-Time Transcription, and the m3u8+ts
file generated by Cloud Recording are two independent files. The time stamp references in these media
files are different, and not synchronized. The cloud recording time stamp starts at 0
, while the m3u8+vtt
uses the system time stamp. If either process starts abnormally, the media files generated by the two services may be out of sync during playback.
Post-processing ensures synchronization of subtitles and recorded audio. It enables you to associate the m3u8+ts
file generated by cloud recording with the m3u8+vtt
file generated by Real-Time Transcription.
Agora provides a post-processing script that enables you to synchronize the two files.
To synchronize files generated by Real-Time Transcription, take the following steps:
-
Unzip the post-processing script to a local folder.
-
Run the script on your Real-Time Transcription files:
python3 insert_subtitle.py --av audio_dir/audio_ts.m3u8 --subtitle subtitle_dir/subtitle.m3u8 --output output_dir/ --overwrite
If
ffmpeg/ffprob
are not in yourPATH
, use–ffmpeg_path
to specify the path. -
Play the synchronized files:
-
Start the HTTP server by running the following command:
python3 -m http.server --bind 127.0.0.1 -doutput_dir
-
In your browser, enter the following URL:
http://127.0.0.1:8000/player_demo.html
-
This section contains information that completes the information in this page, or points you to documentation that explains other aspects to this product.
Refer to the Real-Time Transcription REST API documentation for parameter details.
Use the following language codes in the recognizeConfig.language
parameter of the start request. The current version supports at most two languages, separated by commas.
Language | Code |
---|---|
Chinese (Cantonese, Traditional) | zh-HK |
Chinese (Mandarin, Simplified) | zh-CN |
Chinese (Taiwanese Putonghua) | zh-TW |
English (India) | en-IN |
English (US) | en-US |
French (French) | fr-FR |
German (Germany) | de-DE |
Hindi (India) | hi-IN |
Indonesian (Indonesia) | id-ID |
Italian (Italy) | it-IT |
Japanese (Japan) | ja-JP |
Korean (South Korea) | ko-KR |
Portuguese (Portugal) | pt-PT |
Spanish (Spain) | es-ES |
The following third-party cloud storage service providers are supported: