-
Notifications
You must be signed in to change notification settings - Fork 550
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of github.com:kyutai-labs/moshi
- Loading branch information
Showing
19 changed files
with
601 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,85 @@ | ||
# moshi | ||
# Moshi: a speech-text fundation model for real time dialogue | ||
|
||
![precommit badge](https://github.com/kyutai-labs/moshi/workflows/precommit/badge.svg) | ||
![rust ci badge](https://github.com/kyutai-labs/moshi/workflows/Rust%20CI/badge.svg) | ||
|
||
[Moshi][moshi] is a speech-text foundation model and **full-duplex** spoken dialogue framework. | ||
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses | ||
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), | ||
yet performs better than existing, non-streaming, codec like | ||
[SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps). | ||
|
||
Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user. | ||
At inference, the stream from the user is taken from the audio input, | ||
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**, | ||
which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step, | ||
while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency | ||
of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms. | ||
[Talk to Moshi](https://moshi.chat) now on our live demo. | ||
|
||
<p align="center"> | ||
<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio: | ||
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input, | ||
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech | ||
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step." | ||
width="650px"></p> | ||
|
||
Mimi builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312) | ||
and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder, | ||
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the | ||
average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive steps in Moshi. | ||
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match | ||
a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while | ||
Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM, | ||
without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi | ||
uses **only an adversarial training loss**, along with feature matching, showing strong improvements in terms of subjective quality | ||
despite its low bitrate. | ||
|
||
<p align="center"> | ||
<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio: | ||
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input, | ||
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech | ||
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step." | ||
width="800px"></p> | ||
|
||
|
||
## Organisation of the repository | ||
|
||
There are three separate versions of the moshi inference stack in this repo. | ||
- The python version using PyTorch is in the `moshi` directory. | ||
- The python version using MLX is in the `moshi_mlx` directory. | ||
- The rust version used in production is in the `rust` directory. | ||
- The python version using PyTorch is in the [`moshi/`](moshi/) directory. | ||
- The python version using MLX for M series Macs is in the [`moshi_mlx/`](moshi_mlx/) directory. | ||
- The rust version used in production is in the [`rust/`](rust/) directory. | ||
|
||
Finally, the code for the live demo is provided in the [`client/`](client/) directory. | ||
|
||
## Requirements | ||
|
||
You will need at least Python 3.10. For using the rust backend, you will need a recent version of | ||
the [Rust toolchain](https://rustup.rs/). For specific requirements, please check the individual backends | ||
directories. You can install the PyTorch and MLX clients with the following: | ||
|
||
```bash | ||
pip install moshi # moshi PyTorch, from PyPI | ||
pip install moshi_mlx # moshi MLX, from PyPI | ||
# Or the bleeding edge versions for Moshi and Moshi-MLX. | ||
pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi" | ||
pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx" | ||
``` | ||
|
||
While we hope that the present codebase will work on Windows, we do not provide official support for it. | ||
We have tested the MLX version with MacBook Pro M3. At the moment, we do not support quantization | ||
for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB). | ||
|
||
|
||
## Development | ||
|
||
If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following: | ||
``` | ||
# From the root of the clone of the repo | ||
pip install -e 'moshi[dev]' | ||
pip install -e 'moshi_mlx[dev]' | ||
pre-commit install | ||
``` | ||
|
||
## Python (PyTorch) | ||
|
||
|
@@ -15,38 +91,37 @@ run the model, you can then use either the web UI or a command line client. | |
|
||
Start the server with: | ||
```bash | ||
PYTHONPATH=moshi python -m moshi.server | ||
python -m moshi.server [--gradio_tunnel] | ||
``` | ||
|
||
And then access the web UI on [localhost:8998](http://localhost:8998). | ||
|
||
If the server is running on a remote box, you may want to forward the 8998 port | ||
via your ssh connection so as to be able to access the web UI locally. | ||
And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine | ||
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere. | ||
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe). | ||
Alternatively, you might want to use SSH to redirect your connection. | ||
|
||
Accessing a server that is not localhost via http may cause issues around using | ||
the microphone in the web UI (in some browsers this is only allowed using | ||
https). | ||
|
||
## Python (MLX) for local inference on macOS | ||
|
||
You can either compile and install the `rustymimi` extension or install it via | ||
pip. | ||
A local client is also available, as | ||
```bash | ||
# Install from pip: | ||
pip install rustymimi==0.1.1 | ||
# Alternatively, if you want to compile the package run: | ||
maturin dev -r -m rust/mimi-pyo3/Cargo.toml | ||
python -m moshi.client [--url URL_TO_GRADIO] | ||
``` | ||
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation, | ||
nor does it try to compensate for a growing lag by skipping frames. | ||
|
||
## Python (MLX) for local inference on macOS | ||
|
||
Then the model can be run with: | ||
Once you have installed `moshi_mlx`, you can run | ||
```bash | ||
PYTHONPATH=moshi_mlx python -m moshi_mlx.local \ | ||
--model ~/tmp/[email protected] \ | ||
--mimi ~/tmp/tokenizer-e351c8d8-checkpoint125.safetensors \ | ||
--quantized 8 | ||
python -m moshi_mlx.local -q 4 # weights quantized to 4 bits | ||
python -m moshi_mlx.local -q 8 # weights quantized to 8 bits | ||
``` | ||
|
||
This uses a command line interface, alternatively you can use `local_web` to use | ||
This uses a command line interface, which is bare bone. It doesn't do any echo cancellation, | ||
nor does it try to compensate for a growing lag by skipping frames. | ||
|
||
Alternatively you can use `python -m moshi_mlx.local_web` to use | ||
the web UI, connection is via http on [localhost:8998](http://localhost:8998). | ||
|
||
## Rust | ||
|
@@ -102,3 +177,25 @@ npm run build | |
``` | ||
|
||
The web UI can then be found in the `client/dist` directory. | ||
|
||
## License | ||
|
||
The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. | ||
The web client code is provided under the MIT license. | ||
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under | ||
the MIT license. | ||
|
||
## Citation | ||
|
||
If you use either Mimi or Moshi, please cite the following paper, | ||
|
||
``` | ||
@article{defossez2024moshi, | ||
title={Moshi: a speech-text foundation model for real-time dialogue}, | ||
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, | ||
journal={arXiv:TBC}, | ||
year={2024}, | ||
} | ||
``` | ||
|
||
[moshi]: https://arxiv.org/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
Permission is hereby granted, free of charge, to any | ||
person obtaining a copy of this software and associated | ||
documentation files (the "Software"), to deal in the | ||
Software without restriction, including without | ||
limitation the rights to use, copy, modify, merge, | ||
publish, distribute, sublicense, and/or sell copies of | ||
the Software, and to permit persons to whom the Software | ||
is furnished to do so, subject to the following | ||
conditions: | ||
|
||
The above copyright notice and this permission notice | ||
shall be included in all copies or substantial portions | ||
of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF | ||
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED | ||
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A | ||
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT | ||
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY | ||
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | ||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR | ||
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER | ||
DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
include LICENSE* | ||
include *.md | ||
include *.cfg | ||
include requirements.txt | ||
include moshi/py.typed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,97 @@ | ||
# moshi - pytorch | ||
# Moshi - PyTorch | ||
|
||
See the [top-level README.md][main_repo] for more information on Moshi. | ||
|
||
[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework. | ||
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compress | ||
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec. | ||
|
||
This is the PyTorch implementation for Moshi and Mimi. | ||
|
||
|
||
## Requirements | ||
|
||
You will need at least Python 3.10. We kept a minimal set of dependencies for the current project. | ||
It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure | ||
to have PyTorch properly installed before installing Moshi. | ||
|
||
```bash | ||
pip install moshi # moshi PyTorch, from PyPI | ||
# Or the bleeding edge versions for Moshi | ||
pip install -e "git+https://[email protected]/kyutai-labs/moshi#egg=moshi&subdirectory=moshi" | ||
``` | ||
|
||
While we hope that the present codebase will work on Windows, we do not provide official support for it. | ||
At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB). | ||
|
||
|
||
## Usage | ||
|
||
This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi). | ||
|
||
In order to run in interactive mode, you need to start a server which will | ||
run the model, you can then use either the web UI or a command line client. | ||
|
||
Start the server with: | ||
```bash | ||
python -m moshi.server [--gradio_tunnel] | ||
``` | ||
|
||
And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine | ||
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere. | ||
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe). | ||
Alternatively, you might want to use SSH to redirect your connection. | ||
|
||
Accessing a server that is not localhost via http may cause issues around using | ||
the microphone in the web UI (in some browsers this is only allowed using | ||
https). | ||
|
||
A local client is also available, as | ||
```bash | ||
python -m moshi.client [--url URL_TO_GRADIO] | ||
``` | ||
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation, | ||
nor does it try to compensate for a growing lag by skipping frames. | ||
|
||
## Development | ||
|
||
If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following: | ||
```bash | ||
# From the current folder (e.g. `moshi/`) | ||
pip install -e '.[dev]' | ||
pre-commit install | ||
``` | ||
|
||
Once locally installed, Mimi can be tested with the following command, from **the root** of the repository, | ||
```bash | ||
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3 | ||
python scripts/mimi_test.py | ||
|
||
``` | ||
|
||
Similary, Moshi can be tested (with a GPU) with | ||
```bash | ||
python scripts/moshi_benchmark.py | ||
``` | ||
|
||
|
||
## License | ||
|
||
The present code is provided under the MIT license. | ||
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under | ||
the MIT license. | ||
|
||
## Citation | ||
|
||
If you use either Mimi or Moshi, please cite the following paper, | ||
|
||
``` | ||
@article{defossez2024moshi, | ||
title={Moshi: a speech-text foundation model for real-time dialogue}, | ||
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, | ||
journal={arXiv:TBC}, | ||
year={2024}, | ||
} | ||
``` | ||
|
||
[moshi]: https://arxiv.org/ |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,3 +4,7 @@ max-line-length = 120 | |
[flake8] | ||
max-line-length = 120 | ||
ignore = E203,E704 | ||
exclude = | ||
dist | ||
build | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
include LICENSE* | ||
include *.md | ||
include *.cfg | ||
include requirements.txt | ||
include moshi_mlx/py.typed |
Oops, something went wrong.