From 5e5f498aed87a07834ea9c8747f1bfc8aed2f0ca Mon Sep 17 00:00:00 2001 From: Manu Orsini <166398341+manukyutai@users.noreply.github.com> Date: Wed, 18 Sep 2024 14:24:16 +0200 Subject: [PATCH] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 11f4370..b1dbdb9 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps). Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user. - During inference, the stream from the user is taken from the audio input, + At inference, the stream from the user is taken from the audio input, and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**, which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given time step, while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency