Skip to content

Commit

Permalink
Merge branch 'main' of github.com:kyutai-labs/moshi
Browse files Browse the repository at this point in the history
  • Loading branch information
adefossez committed Sep 18, 2024
2 parents 0da83c7 + ebd489f commit bcb79be
Show file tree
Hide file tree
Showing 19 changed files with 601 additions and 42 deletions.
2 changes: 1 addition & 1 deletion .github/actions/moshi_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ runs:
. env/bin/activate
python -m pip install --upgrade pip
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu
pip install -e './moshi[dev]'
- name: Setup env
shell: bash
run: |
. env/bin/activate
pre-commit install
pip install -e './moshi[dev]'
8 changes: 6 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,23 @@ repos:
language: system
entry: bash -c 'cd moshi && flake8'
pass_filenames: false
always_run: true
- id: pyright-moshi
name: pyright on moshi package
language: system
entry: bash -c 'cd moshi && pyright'
entry: scripts/run_ci_when_installed.sh moshi 'cd moshi && pyright'
pass_filenames: false
always_run: true
- id: flake8-moshi_mlx
name: flake8 on moshi_mlx package
language: system
entry: bash -c 'cd moshi_mlx && flake8'
pass_filenames: false
always_run: true
- id: pyright-moshi_mlx
name: pyright on moshi_mlx package
language: system
entry: bash -c 'cd moshi_mlx && pyright'
entry: scripts/run_ci_when_installed.sh moshi_mlx 'cd moshi_mlx && pyright'
pass_filenames: false
always_run: true

143 changes: 120 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,85 @@
# moshi
# Moshi: a speech-text fundation model for real time dialogue

![precommit badge](https://github.com/kyutai-labs/moshi/workflows/precommit/badge.svg)
![rust ci badge](https://github.com/kyutai-labs/moshi/workflows/Rust%20CI/badge.svg)

[Moshi][moshi] is a speech-text foundation model and **full-duplex** spoken dialogue framework.
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
yet performs better than existing, non-streaming, codec like
[SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).

Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
At inference, the stream from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**,
which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency
of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
[Talk to Moshi](https://moshi.chat) now on our live demo.

<p align="center">
<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="650px"></p>

Mimi builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive steps in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
uses **only an adversarial training loss**, along with feature matching, showing strong improvements in terms of subjective quality
despite its low bitrate.

<p align="center">
<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="800px"></p>


## Organisation of the repository

There are three separate versions of the moshi inference stack in this repo.
- The python version using PyTorch is in the `moshi` directory.
- The python version using MLX is in the `moshi_mlx` directory.
- The rust version used in production is in the `rust` directory.
- The python version using PyTorch is in the [`moshi/`](moshi/) directory.
- The python version using MLX for M series Macs is in the [`moshi_mlx/`](moshi_mlx/) directory.
- The rust version used in production is in the [`rust/`](rust/) directory.

Finally, the code for the live demo is provided in the [`client/`](client/) directory.

## Requirements

You will need at least Python 3.10. For using the rust backend, you will need a recent version of
the [Rust toolchain](https://rustup.rs/). For specific requirements, please check the individual backends
directories. You can install the PyTorch and MLX clients with the following:

```bash
pip install moshi # moshi PyTorch, from PyPI
pip install moshi_mlx # moshi MLX, from PyPI
# Or the bleeding edge versions for Moshi and Moshi-MLX.
pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"
```

While we hope that the present codebase will work on Windows, we do not provide official support for it.
We have tested the MLX version with MacBook Pro M3. At the moment, we do not support quantization
for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).


## Development

If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
```
# From the root of the clone of the repo
pip install -e 'moshi[dev]'
pip install -e 'moshi_mlx[dev]'
pre-commit install
```

## Python (PyTorch)

Expand All @@ -15,38 +91,37 @@ run the model, you can then use either the web UI or a command line client.

Start the server with:
```bash
PYTHONPATH=moshi python -m moshi.server
python -m moshi.server [--gradio_tunnel]
```

And then access the web UI on [localhost:8998](http://localhost:8998).

If the server is running on a remote box, you may want to forward the 8998 port
via your ssh connection so as to be able to access the web UI locally.
And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
Alternatively, you might want to use SSH to redirect your connection.

Accessing a server that is not localhost via http may cause issues around using
the microphone in the web UI (in some browsers this is only allowed using
https).

## Python (MLX) for local inference on macOS

You can either compile and install the `rustymimi` extension or install it via
pip.
A local client is also available, as
```bash
# Install from pip:
pip install rustymimi==0.1.1
# Alternatively, if you want to compile the package run:
maturin dev -r -m rust/mimi-pyo3/Cargo.toml
python -m moshi.client [--url URL_TO_GRADIO]
```
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

## Python (MLX) for local inference on macOS

Then the model can be run with:
Once you have installed `moshi_mlx`, you can run
```bash
PYTHONPATH=moshi_mlx python -m moshi_mlx.local \
--model ~/tmp/[email protected] \
--mimi ~/tmp/tokenizer-e351c8d8-checkpoint125.safetensors \
--quantized 8
python -m moshi_mlx.local -q 4 # weights quantized to 4 bits
python -m moshi_mlx.local -q 8 # weights quantized to 8 bits
```

This uses a command line interface, alternatively you can use `local_web` to use
This uses a command line interface, which is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

Alternatively you can use `python -m moshi_mlx.local_web` to use
the web UI, connection is via http on [localhost:8998](http://localhost:8998).

## Rust
Expand Down Expand Up @@ -102,3 +177,25 @@ npm run build
```

The web UI can then be found in the `client/dist` directory.

## License

The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
The web client code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.

## Citation

If you use either Mimi or Moshi, please cite the following paper,

```
@article{defossez2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
journal={arXiv:TBC},
year={2024},
}
```

[moshi]: https://arxiv.org/
23 changes: 23 additions & 0 deletions client/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
4 changes: 4 additions & 0 deletions client/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ Frontend for the demo.
## Skipping the queue
To skip the queue for standalone use, once the project is running go to `/?worker_addr={WORKER_ADDR}` where `WORKER_ADDR` is your worker instance address.
For example : `https://localhost:5173/?worker_addr=0.0.0.0:8088`

## License

The present code is provided under the MIT license.
Binary file added mimi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added moshi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
5 changes: 5 additions & 0 deletions moshi/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
include LICENSE*
include *.md
include *.cfg
include requirements.txt
include moshi/py.typed
98 changes: 97 additions & 1 deletion moshi/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,97 @@
# moshi - pytorch
# Moshi - PyTorch

See the [top-level README.md][main_repo] for more information on Moshi.

[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compress
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec.

This is the PyTorch implementation for Moshi and Mimi.


## Requirements

You will need at least Python 3.10. We kept a minimal set of dependencies for the current project.
It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure
to have PyTorch properly installed before installing Moshi.

```bash
pip install moshi # moshi PyTorch, from PyPI
# Or the bleeding edge versions for Moshi
pip install -e "git+https://[email protected]/kyutai-labs/moshi#egg=moshi&subdirectory=moshi"
```

While we hope that the present codebase will work on Windows, we do not provide official support for it.
At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).


## Usage

This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).

In order to run in interactive mode, you need to start a server which will
run the model, you can then use either the web UI or a command line client.

Start the server with:
```bash
python -m moshi.server [--gradio_tunnel]
```

And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
Alternatively, you might want to use SSH to redirect your connection.

Accessing a server that is not localhost via http may cause issues around using
the microphone in the web UI (in some browsers this is only allowed using
https).

A local client is also available, as
```bash
python -m moshi.client [--url URL_TO_GRADIO]
```
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

## Development

If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
```bash
# From the current folder (e.g. `moshi/`)
pip install -e '.[dev]'
pre-commit install
```

Once locally installed, Mimi can be tested with the following command, from **the root** of the repository,
```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
python scripts/mimi_test.py

```

Similary, Moshi can be tested (with a GPU) with
```bash
python scripts/moshi_benchmark.py
```


## License

The present code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.

## Citation

If you use either Mimi or Moshi, please cite the following paper,

```
@article{defossez2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
journal={arXiv:TBC},
year={2024},
}
```

[moshi]: https://arxiv.org/
15 changes: 0 additions & 15 deletions moshi/moshi/testing.md

This file was deleted.

4 changes: 4 additions & 0 deletions moshi/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ max-line-length = 120
[flake8]
max-line-length = 120
ignore = E203,E704
exclude =
dist
build

5 changes: 5 additions & 0 deletions moshi_mlx/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
include LICENSE*
include *.md
include *.cfg
include requirements.txt
include moshi_mlx/py.typed
Loading

0 comments on commit bcb79be

Please sign in to comment.