Merge branch 'main' of github.com:kyutai-labs/moshi

kyutai-labs · Sep 18, 2024 · bcb79be · bcb79be
2 parents 0da83c7 + ebd489f
commit bcb79be
Show file tree

Hide file tree

Showing 19 changed files with 601 additions and 42 deletions.
diff --git a/.github/actions/moshi_build/action.yml b/.github/actions/moshi_build/action.yml
@@ -19,9 +19,9 @@ runs:
       .  env/bin/activate
       python -m pip install --upgrade pip
       pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu
-      pip install -e './moshi[dev]'
   - name: Setup env
     shell: bash
     run: |
       .  env/bin/activate
       pre-commit install
+      pip install -e './moshi[dev]'
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -6,19 +6,23 @@ repos:
       language: system
       entry: bash -c 'cd moshi && flake8'
       pass_filenames: false
+      always_run: true
     - id: pyright-moshi
       name: pyright on moshi package
       language: system
-      entry: bash -c 'cd moshi && pyright'
+      entry: scripts/run_ci_when_installed.sh moshi 'cd moshi && pyright'
       pass_filenames: false
+      always_run: true
     - id: flake8-moshi_mlx
       name: flake8 on moshi_mlx package
       language: system
       entry: bash -c 'cd moshi_mlx && flake8'
       pass_filenames: false
+      always_run: true
     - id: pyright-moshi_mlx
       name: pyright on moshi_mlx package
       language: system
-      entry: bash -c 'cd moshi_mlx && pyright'
+      entry: scripts/run_ci_when_installed.sh moshi_mlx 'cd moshi_mlx && pyright'
       pass_filenames: false
+      always_run: true
 
diff --git a/README.md b/README.md
@@ -1,9 +1,85 @@
-# moshi
+# Moshi: a speech-text fundation model for real time dialogue
+
+![precommit badge](https://github.com/kyutai-labs/moshi/workflows/precommit/badge.svg)
+![rust ci badge](https://github.com/kyutai-labs/moshi/workflows/Rust%20CI/badge.svg)
+
+ [Moshi][moshi] is a speech-text foundation model and **full-duplex** spoken dialogue framework.
+ It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses
+ audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
+ yet performs better than existing, non-streaming, codec like
+ [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).
+
+ Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
+ At inference, the stream from the user is taken from the audio input,
+and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**,
+which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
+while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency
+of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
+[Talk to Moshi](https://moshi.chat) now on our live demo.
+
+<p align="center">
+<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="650px"></p>
+
+Mimi builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
+and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
+and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
+average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive steps in Moshi.
+Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
+a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
+Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
+without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
+uses **only an adversarial training loss**, along with feature matching, showing strong improvements in terms of subjective quality
+despite its low bitrate.
+
+<p align="center">
+<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="800px"></p>
+
+
+## Organisation of the repository
 
 There are three separate versions of the moshi inference stack in this repo.
-- The python version using PyTorch is in the `moshi` directory.
-- The python version using MLX is in the `moshi_mlx` directory.
-- The rust version used in production is in the `rust` directory.
+- The python version using PyTorch is in the [`moshi/`](moshi/) directory.
+- The python version using MLX for M series Macs is in the [`moshi_mlx/`](moshi_mlx/) directory.
+- The rust version used in production is in the [`rust/`](rust/) directory.
+
+Finally, the code for the live demo is provided in the [`client/`](client/) directory.
+
+## Requirements
+
+You will need at least Python 3.10. For using the rust backend, you will need a recent version of
+the [Rust toolchain](https://rustup.rs/). For specific requirements, please check the individual backends
+directories. You can install the PyTorch and MLX clients with the following:
+
+```bash
+pip install moshi      # moshi PyTorch, from PyPI
+pip install moshi_mlx  # moshi MLX, from PyPI
+# Or the bleeding edge versions for Moshi and Moshi-MLX.
+pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
+pip install -e "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"
+```
+
+While we hope that the present codebase will work on Windows, we do not provide official support for it.
+We have tested the MLX version with MacBook Pro M3. At the moment, we do not support quantization
+for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).
+
+
+## Development
+
+If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
+```
+# From the root of the clone of the repo
+pip install -e 'moshi[dev]'
+pip install -e 'moshi_mlx[dev]'
+pre-commit install
+```
 
 ## Python (PyTorch)
 
@@ -15,38 +91,37 @@ run the model, you can then use either the web UI or a command line client.
 
 Start the server with:
 ```bash
-PYTHONPATH=moshi python -m moshi.server
+python -m moshi.server [--gradio_tunnel]
 ```
 
-And then access the web UI on [localhost:8998](http://localhost:8998).
-
-If the server is running on a remote box, you may want to forward the 8998 port
-via your ssh connection so as to be able to access the web UI locally.
+And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
+with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
+Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
+Alternatively, you might want to use SSH to redirect your connection.
 
 Accessing a server that is not localhost via http may cause issues around using
 the microphone in the web UI (in some browsers this is only allowed using
 https).
 
-## Python (MLX) for local inference on macOS
-
-You can either compile and install the `rustymimi` extension or install it via
-pip.
+A local client is also available, as
 ```bash
-# Install from pip:
-pip install rustymimi==0.1.1
-# Alternatively, if you want to compile the package run:
-maturin dev -r -m rust/mimi-pyo3/Cargo.toml
+python -m moshi.client [--url URL_TO_GRADIO]
 ```
+However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
+## Python (MLX) for local inference on macOS
 
-Then the model can be run with:
+Once you have installed `moshi_mlx`, you can run
 ```bash
-PYTHONPATH=moshi_mlx python -m moshi_mlx.local  \
-    --model ~/tmp/[email protected] \
-    --mimi ~/tmp/tokenizer-e351c8d8-checkpoint125.safetensors \
-    --quantized 8
+python -m moshi_mlx.local -q 4   # weights quantized to 4 bits
+python -m moshi_mlx.local -q 8   # weights quantized to 8 bits
 ```
 
-This uses a command line interface, alternatively you can use `local_web` to use
+This uses a command line interface, which is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
+Alternatively you can use `python -m moshi_mlx.local_web` to use
 the web UI, connection is via http on [localhost:8998](http://localhost:8998).
 
 ## Rust
@@ -102,3 +177,25 @@ npm run build
 ```
 
 The web UI can then be found in the `client/dist` directory.
+
+## License
+
+The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
+The web client code is provided under the MIT license.
+Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
+the MIT license.
+
+## Citation
+
+If you use either Mimi or Moshi, please cite the following paper,
+
+```
+@article{defossez2024moshi,
+    title={Moshi: a speech-text foundation model for real-time dialogue},
+    author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
+    journal={arXiv:TBC},
+    year={2024},
+}
+```
+
+[moshi]: https://arxiv.org/
diff --git a/client/LICENSE b/client/LICENSE
@@ -0,0 +1,23 @@
+Permission is hereby granted, free of charge, to any
+person obtaining a copy of this software and associated
+documentation files (the "Software"), to deal in the
+Software without restriction, including without
+limitation the rights to use, copy, modify, merge,
+publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software
+is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice
+shall be included in all copies or substantial portions
+of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
+ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
+PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
+SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
+IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/client/README.md b/client/README.md
@@ -14,3 +14,7 @@ Frontend for the demo.
 ## Skipping the queue
 To skip the queue for standalone use, once the project is running go to `/?worker_addr={WORKER_ADDR}` where `WORKER_ADDR` is your worker instance address.
 For example : `https://localhost:5173/?worker_addr=0.0.0.0:8088`
+
+## License
+
+The present code is provided under the MIT license.
diff --git a/mimi.png b/mimi.png
diff --git a/moshi.png b/moshi.png
diff --git a/moshi/moshi/LICENSE.audiocraft → moshi/LICENSE.audiocraft b/moshi/moshi/LICENSE.audiocraft → moshi/LICENSE.audiocraft
diff --git a/moshi/MANIFEST.in b/moshi/MANIFEST.in
@@ -0,0 +1,5 @@
+include LICENSE*
+include *.md
+include *.cfg
+include requirements.txt
+include moshi/py.typed
diff --git a/moshi/README.md b/moshi/README.md
@@ -1 +1,97 @@
-# moshi - pytorch
+# Moshi - PyTorch
+
+See the [top-level README.md][main_repo] for more information on Moshi.
+
+[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
+It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compress
+audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec.
+
+This is the PyTorch implementation for Moshi and Mimi.
+
+
+## Requirements
+
+You will need at least Python 3.10. We kept a minimal set of dependencies for the current project.
+It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure
+to have PyTorch properly installed before installing Moshi.
+
+```bash
+pip install moshi      # moshi PyTorch, from PyPI
+# Or the bleeding edge versions for Moshi
+pip install -e "git+https://[email protected]/kyutai-labs/moshi#egg=moshi&subdirectory=moshi"
+```
+
+While we hope that the present codebase will work on Windows, we do not provide official support for it.
+At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).
+
+
+## Usage
+
+This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).
+
+In order to run in interactive mode, you need to start a server which will
+run the model, you can then use either the web UI or a command line client.
+
+Start the server with:
+```bash
+python -m moshi.server [--gradio_tunnel]
+```
+
+And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
+with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
+Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
+Alternatively, you might want to use SSH to redirect your connection.
+
+Accessing a server that is not localhost via http may cause issues around using
+the microphone in the web UI (in some browsers this is only allowed using
+https).
+
+A local client is also available, as
+```bash
+python -m moshi.client [--url URL_TO_GRADIO]
+```
+However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
+## Development
+
+If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
+```bash
+# From the current folder (e.g. `moshi/`)
+pip install -e '.[dev]'
+pre-commit install
+```
+
+Once locally installed, Mimi can be tested with the following command, from **the root** of the repository,
+```bash
+wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
+python scripts/mimi_test.py
+
+```
+
+Similary, Moshi can be tested (with a GPU) with
+```bash
+python scripts/moshi_benchmark.py
+```
+
+
+## License
+
+The present code is provided under the MIT license.
+Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
+the MIT license.
+
+## Citation
+
+If you use either Mimi or Moshi, please cite the following paper,
+
+```
+@article{defossez2024moshi,
+    title={Moshi: a speech-text foundation model for real-time dialogue},
+    author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
+    journal={arXiv:TBC},
+    year={2024},
+}
+```
+
+[moshi]: https://arxiv.org/
diff --git a/moshi/moshi/testing.md b/moshi/moshi/testing.md
diff --git a/moshi/setup.cfg b/moshi/setup.cfg
@@ -4,3 +4,7 @@ max-line-length = 120
 [flake8]
 max-line-length = 120
 ignore = E203,E704
+exclude = 
+    dist
+    build
+
diff --git a/moshi_mlx/MANIFEST.in b/moshi_mlx/MANIFEST.in
@@ -0,0 +1,5 @@
+include LICENSE*
+include *.md
+include *.cfg
+include requirements.txt
+include moshi_mlx/py.typed