diff --git a/README.md b/README.md index 6dfb5f2..623cad8 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Moshi models **two streams of audio**: one corresponds to Moshi, and the other one to the user. At inference, the stream from the user is taken from the audio input, and the one for Moshi is sampled from the model's output. Along these two audio streams, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**, -which greatly improves the quality of its generation. A small Depth Transformer models inter codebook dependencies for a given step, +which greatly improves the quality of its generation. A small Depth Transformer models inter codebook dependencies for a given time step, while a large, 7B parameter Temporal Transformer models the temporal dependencies. Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an L4 GPU. @@ -117,7 +117,7 @@ maturin dev -r -m rust/mimi-pyo3/Cargo.toml ## Python (PyTorch) The PyTorch based API can be found in the `moshi` directory. It provides a streaming -version of the audio tokenizer (mimi) and the lm model (moshi). +version of the audio tokenizer (mimi) and the language model (moshi). In order to run in interactive mode, you need to start a server which will run the model, you can then use either the web UI or a command line client.