Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. This is a BentoML example project, showing you how to serve and deploy Moshi with BentoML. Specifically, it creates a real-time voice chat application by implementing a WebSocket endpoint for bi-directional audio streaming.
Here is the workflow after you start the server:
- You speak into your microphone. The client records the audio and sends it to the server in real-time via a WebSocket connection.
- The server uses the Mimi model to process the audio and the Moshi language model to generate both text and audio responses.
- The server sends the generated text and audio back to the client.
- The client plays the audio through your speakers and displays the text in the terminal.
Check out the full list of example projects to explore more BentoML use cases.
If you want to test the Service locally, we recommend you use an Nvidia GPU with at least 24Gb VRAM.
-
Install
uv
.curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone the project directory.
git clone https://github.com/bentoml/BentoMoshi.git && cd BentoMoshi
-
Try local serving:
# option 1: bentoml serve [RECOMMENDED] uvx --with-editable . bentoml serve . --debug # option 2: script uvx --from . server
-
The server will be running at
http://localhost:3000
. To connect to the WebSocket endpoint, use the following:URL=http://localhost:3000 uvx --from . client
You can deploy this project to BentoCloud for better management and scalability. Sign up if you haven't got a BentoCloud account.
Make sure you have logged in to BentoCloud.
bentoml cloud login
Deploy it to BentoCloud.
uvx --with-editable . bentoml deploy .
After deployment, specify the URL
on BentoCloud and use the client:
# option 1: uvx [RECOMMENDED]
URL=<bentocloud-endpoint> uv run --with-editable . bentomoshi/client.py
# option 2: using python
URL=<bentocloud-endpoint> python bentomoshi/client.py
Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.