Inference

The pretrained model checkpoints can be reached at 🤗 Hugging Face and 🤖 Model Scope, or will be automatically downloaded when running inference scripts.

More checkpoints with whole community efforts can be found in SHARED.md, supporting more languages.

Currently support 30s for a single generation, which is the total length including both prompt and output audio. However, you can provide infer_cli and infer_gradio with longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.

To avoid possible inference failures, make sure you have seen through the following instructions.

Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).

Gradio App

Currently supported features:

Basic TTS with Chunk Inference
Multi-Style / Multi-Speaker Generation
Voice Chat powered by Qwen2.5-3B-Instruct

The cli command f5-tts_infer-gradio equals to python src/f5_tts/infer/infer_gradio.py, which launches a Gradio APP (web interface) for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and update the path to load_model() in infer_gradio.py. Currently only load TTS models first, will load ASR model to do transcription if ref_text not provided, will load LLM model if use Voice Chat.

Could also be used as a component for larger application.

import gradio as gr
from f5_tts.infer.infer_gradio import app

with gr.Blocks() as main_app:
    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")

    # ... other Gradio components

    app.render()

main_app.launch()

CLI Inference

The cli command f5-tts_infer-cli equals to python src/f5_tts/infer/infer_cli.py, which is a command line tool for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and use --ckpt_file to specify the model you want to load, or directly update in infer_cli.py.

For change vocab.txt use --vocab_file to provide your vocab.txt file.

Basically you can inference with flags:

# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Choose Vocoder
f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>

# More instructions
f5-tts_infer-cli --help

And a .toml file would help with more flexible usage.

f5-tts_infer-cli -c custom.toml

For example, you can use .toml to pass in variables, refer to src/f5_tts/infer/examples/basic/basic.toml:

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/basic/basic_ref_en.wav"
# If an empty "", transcribes the reference audio automatically.
ref_text = "Some call me nature, others call me mother nature."
gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
# File with text to generate. Ignores the text above.
gen_file = ""
remove_silence = false
output_dir = "tests"

You can also leverage .toml file to do multi-style generation, refer to src/f5_tts/infer/examples/multi/story.toml.

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/multi/main.flac"
# If an empty "", transcribes the reference audio automatically.
ref_text = ""
gen_text = ""
# File with text to generate. Ignores the text above.
gen_file = "infer/examples/multi/story.txt"
remove_silence = true
output_dir = "tests"

[voices.town]
ref_audio = "infer/examples/multi/town.flac"
ref_text = ""

[voices.country]
ref_audio = "infer/examples/multi/country.flac"
ref_text = ""

You should mark the voice with [main] [town] [country] whenever you want to change voice, refer to src/f5_tts/infer/examples/multi/story.txt.

Speech Editing

To test speech editing capabilities, use the following command:

python src/f5_tts/infer/speech_edit.py

Socket Realtime Client

To communicate with socket server you need to run

python src/f5_tts/socket_server.py

Then create client to communicate

import socket
import numpy as np
import asyncio
import pyaudio

async def listen_to_voice(text, server_ip='localhost', server_port=9999):
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client_socket.connect((server_ip, server_port))

    async def play_audio_stream():
        buffer = b''
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paFloat32,
                        channels=1,
                        rate=24000,  # Ensure this matches the server's sampling rate
                        output=True,
                        frames_per_buffer=2048)

        try:
            while True:
                chunk = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 1024)
                if not chunk:  # End of stream
                    break
                if b"END_OF_AUDIO" in chunk:
                    buffer += chunk.replace(b"END_OF_AUDIO", b"")
                    if buffer:
                        audio_array = np.frombuffer(buffer, dtype=np.float32).copy()  # Make a writable copy
                        stream.write(audio_array.tobytes())
                    break
                buffer += chunk
                if len(buffer) >= 4096:
                    audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy()  # Make a writable copy
                    stream.write(audio_array.tobytes())
                    buffer = buffer[4096:]
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()

    try:
        # Send only the text to the server
        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, text.encode('utf-8'))
        await play_audio_stream()
        print("Audio playback finished.")

    except Exception as e:
        print(f"Error in listen_to_voice: {e}")

    finally:
        client_socket.close()

# Example usage: Replace this with your actual server IP and port
async def main():
    await listen_to_voice("my name is jenny..", server_ip='localhost', server_port=9998)

# Run the main async function
asyncio.run(main())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Inference

Gradio App

CLI Inference

Speech Editing

Socket Realtime Client

Files

README.md

Latest commit

History

README.md

File metadata and controls

Inference

Gradio App

CLI Inference

Speech Editing

Socket Realtime Client