Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format documentation; switch to argparse #34

Merged
merged 10 commits into from
Oct 25, 2024
224 changes: 86 additions & 138 deletions moonshine/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,129 +4,100 @@ This directory contains various scripts to demonstrate the capabilities of the
Moonshine ASR models.

- [Moonshine Demos](#moonshine-demos)
- [Standalone file transcription.](#standalone-file-transcription)
- [Live caption microphone demo.](#live-caption-microphone-demo)
- [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx)
- [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input)
- [Installation.](#installation)
- [Environment.](#environment)
- [Download the ONNX models.](#download-the-onnx-models)
- [Run the demo.](#run-the-demo)
- [Script notes.](#script-notes)
- [Speech truncation.](#speech-truncation)
- [Running on a slower processor.](#running-on-a-slower-processor)
- [Metrics.](#metrics)
- [Future work.](#future-work)
- [Citation.](#citation)
- [0. Setup environment](#0-setup-environment)
- [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies)
- [2. Download the ONNX models](#download-the-onnx-models)
- [Running the demo](#running-the-demo)
- [Script notes](#script-notes)
- [Speech truncation and hallucination](#speech-truncation-and-hallucination)
- [Running on a slower processor](#running-on-a-slower-processor)
- [Metrics](#metrics)
- [Citation](#citation)


# Standalone file transcription.
# Demo: Standalone file transcription with ONNX

The script
[onnx_standalone.py](/moonshine/demo/onnx_standalone.py)
The script [`onnx_standalone.py`](/moonshine/demo/onnx_standalone.py)
demonstrates how to run a Moonshine model with the `onnxruntime`
package alone, without depending on `torch` or `tensorflow`. This enables
running on SBCs such as Raspberry Pi. Follow the instructions below to setup
and run.

* Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`.
1. Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`.

* Download the `onnx` files from huggingface hub to a directory.
2. Download the `onnx` files from huggingface hub to a directory.

```shell
mkdir moonshine_base_onnx
cd moonshine_base_onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx
cd ..
```
```shell
mkdir moonshine_base_onnx
cd moonshine_base_onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx
cd ..
```

* Run `onnx_standalone.py` to transcribe a wav file
3. Run `onnx_standalone.py` to transcribe a wav file

```shell
moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav
['Ever tried ever failed, no matter try again fail again fail better.']
```
```shell
moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav
['Ever tried ever failed, no matter try again fail again fail better.']
```


# Live caption microphone demo.
# Demo: Live captioning from microphone input

The script
[live_captions.py](/moonshine/demo/live_captions.py) runs Moonshine model on segments of speech detected in the microphone
signal using a voice activity detector called
[silero-vad](https://github.com/snakers4/silero-vad). The script prints
scrolling text or "live captions" assembled from the model predictions.
https://github.com/user-attachments/assets/aa65ef54-d4ac-4d31-864f-222b0e6ccbd3

The following steps were tested in `uv` virtual environment v0.4.25 created in
Ubuntu 22.04 and Ubuntu 24.04 home folders running on a MacBook Pro M2 (ARM)
virtual machine.
This folder contains a demo of live captioning from microphone input, built on Moonshine. The script runs the Moonshine model on segments of speech detected in the microphone signal using a voice activity detector called [`silero-vad`](https://github.com/snakers4/silero-vad). The script prints scrolling text or "live captions" assembled from the model predictions to the console.

- [Moonshine Demos](#moonshine-demos)
- [Standalone file transcription.](#standalone-file-transcription)
- [Live caption microphone demo.](#live-caption-microphone-demo)
- [Installation.](#installation)
- [Environment.](#environment)
- [Download the ONNX models.](#download-the-onnx-models)
- [Run the demo.](#run-the-demo)
- [Script notes.](#script-notes)
- [Speech truncation.](#speech-truncation)
- [Running on a slower processor.](#running-on-a-slower-processor)
- [Metrics.](#metrics)
- [Future work.](#future-work)
- [Citation.](#citation)

## Installation.

This install does not depend on `tensorflow`. We're using
[silero-vad](https://github.com/snakers4/silero-vad)
which has `torch` dependency.

### Environment.

Moonshine installation steps are available in the
[top level README](/README.md) of this repo. Note that this demo is standalone
and has no requirement to install `useful-moonshine`.

First install the `uv` standalone installer as
[described here](https://github.com/astral-sh/uv?tab=readme-ov-file#installation).
Close the shell and re-open after the install. If you don't want to use `uv`
simply skip the virtual environment installation and subsequent activation, and
leave `uv` off the shell commands for `pip install`.

Create the virtual environment and install dependences for Moonshine.
```console
cd
uv venv env_moonshine_demo
source env_moonshine_demo/bin/activate
```
The following steps have been tested in a `uv` (v0.4.25) virtual environment on these platforms:

- macOS 14.1 on a MacBook Pro M3
- Ubuntu 22.04 VM on a MacBook Pro M2
- Ubuntu 24.04 VM on a MacBook Pro M2

## Installation

### 0. Setup environment

Steps to set up a virtual environment are available in the [top level README](/README.md) of this repo. Note that this demo is standalone and has no requirement to install the `useful-moonshine` package. Instead, you will clone the repo.

### 1. Clone the repo and install extra dependencies

You will need to clone the repo first:
```console

```shell
git clone [email protected]:usefulsensors/moonshine.git
```

Then install the demo's extra requirements:
```console

```shell
uv pip install -r moonshine/moonshine/demo/requirements.txt
```

Ubuntu needs PortAudio installing for the package `sounddevice` to run. The
latest version 19.6.0-1.2build3 is suitable.
```console
cd
#### Ubuntu: Install PortAudio

Ubuntu needs PortAudio for the `sounddevice` package to run. The latest version (19.6.0-1.2build3 as of writing) is suitable.

```shell
sudo apt update
sudo apt upgrade -y
sudo apt install -y portaudio19-dev
```

### Download the ONNX models.
### 2. Download the ONNX models

The script finds ONNX base or tiny models in the
`demo/models//moonshine_base_onnx` and `demo/models//moonshine_tiny_onnx`
`demo/models/moonshine_base_onnx` and `demo/models/moonshine_tiny_onnx`
sub-folders.

Download Moonshine `onnx` model files from huggingface hub.
```console
Download Moonshine `onnx` model files from HuggingFace hub.
```shell
cd
mkdir moonshine/moonshine/demo/models
mkdir moonshine/moonshine/demo/models/moonshine_base_onnx
Expand All @@ -147,21 +118,21 @@ wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/uncac
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/cached_decode.onnx
```

## Run the demo.
## Running the demo

Check your microphone is connected and the microphone volume setting is not
muted in your host OS or system audio drivers.
```console
cd
source env_moonshine_demo/bin/activate
First, check that your microphone is connected and that the volume setting is not muted in your host OS or system audio drivers. Then, run the script:

``` shell
python3 moonshine/moonshine/demo/live_captions.py
```
Speak in English language to the microphone and observe live captions in the
terminal. Quit the demo with ctrl + C to see console print of the captions.

By default, this will run the demo with the Moonshine base ONNX model. The `--model_size` argument sets the model to use: supported arguments are `moonshine_base_onnx` and `moonshine_tiny_onnx`.

When running, speak in English language to the microphone and observe live captions in the terminal. Quit the demo with `Ctrl+C` to see a full printout of the captions.

An example run on Ubuntu 24.04 VM on MacBook Pro M2 with Moonshine base ONNX
model.
model:

```console
(env_moonshine_demo) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
Expand All @@ -182,9 +153,10 @@ This is an example of the Moonshine base model being used to generate live capti
(env_moonshine_demo) parallels@ubuntu-linux-2404:~$
```

For comparison this is the Faster-Whisper int8 base model on the same instance.
For comparison, this is the `faster-whisper` int8 base model on the same instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the example is float32 precision (not int8): this is a copy of my error. I can fix this in the parent branch.

The value of `MIN_REFRESH_SECS` was increased as the model inference is too slow
for a value of 0.2 seconds.

```console
(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
Expand All @@ -205,56 +177,32 @@ This is an example of the Faster Whisper float32 base model being used to genera
(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$
```

## Script notes.
## Script notes

You may customize this script to display Moonshine text transcriptions as you wish.

The script `live_captions.py` loads the English language version of Moonshine
base ONNX model. The script includes logic to detect speech activity and limit
the context window of speech fed to the Moonshine model. The returned
transcriptions are displayed as scrolling captions. Speech segments with pauses
are cached and these cached captions are printed on exit. The printed captions
on exit will not contain the latest displayed caption when there was no pause
in the talker's speech prior to pressing ctrl + C. Stop speaking and wait
before pressing ctrl + C. If you are running on a slow or throttled processor
such that the model inferences are not realtime, after speaking stops you should
wait longer for the speech queue to be processed before pressing ctrl + C.

### Speech truncation.

Some hallucinations will be seen when the script is running: one reason is
speech gets truncated out of necessity to generate the frequent refresh and
timeout transcriptions. Truncated speech contains partial or sliced words for
which transcriber model transcriptions are unpredictable. See the printed
captions on script exit for the best results.

### Running on a slower processor.
If you run this script on a slower processor consider using the `tiny` model.
```console
cd
source env_moonshine_demo/bin/activate
The script `live_captions.py` loads the English language version of Moonshine base ONNX model. It includes logic to detect speech activity and limit the context window of speech fed to the Moonshine model. The returned transcriptions are displayed as scrolling captions. Speech segments with pauses are cached and these cached captions are printed on exit. The printed captions on exit will not contain the latest displayed caption when there was no pause in the talker's speech prior to pressing `Ctrl+C`. Stop speaking and wait before pressing `Ctrl+C`. If you are running on a slow or throttled processor such that the model inferences are not realtime, after speaking stops you should wait longer for the speech queue to be processed before pressing `Ctrl+C`.

### Speech truncation and hallucination

python3 ./moonshine/moonshine/demo/live_captions.py moonshine_tiny_onnx
Some hallucinations will be seen when the script is running: one reason is speech gets truncated out of necessity to generate the frequent refresh and timeout transcriptions. Truncated speech contains partial or sliced words for which transcriber model transcriptions are unpredictable. See the printed captions on script exit for the best results.

### Running on a slower processor

If you run this script on a slower processor, consider using the `tiny` model.

```shell
python3 ./moonshine/moonshine/demo/live_captions.py --model_size moonshine_tiny_onnx
```
The value of `MIN_REFRESH_SECS` will be ineffective when the model inference
time exceeds that value. Conversely on a faster processor consider reducing
the value of `MIN_REFRESH_SECS` for more frequent caption updates. On a slower
processor you might also consider reducing the value of `MAX_SPEECH_SECS` to
avoid slower model inferencing encountered with longer speech segments.

### Metrics.
The metrics shown on program exit will vary based on the talker's speaking
style. If the talker speaks with more frequent pauses the speech segments are
shorter and the mean inference time will be lower. This is a feature of the
Moonshine model described in
[our paper](https://arxiv.org/abs/2410.15608).
When benchmarking use the same speech such as a recording of someone talking.
The value of `MIN_REFRESH_SECS` will be ineffective when the model inference time exceeds that value. Conversely on a faster processor consider reducing the value of `MIN_REFRESH_SECS` for more frequent caption updates. On a slower processor you might also consider reducing the value of `MAX_SPEECH_SECS` to avoid slower model inferencing encountered with longer speech segments.

### Metrics

# Future work.
The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking.

* [x] ONNX runtime model version.

# Citation.
# Citation

If you benefit from our work, please cite us:
```
Expand Down
35 changes: 23 additions & 12 deletions moonshine/demo/live_captions.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
"""Live captions from microphone using Moonshine and SileroVAD ONNX models."""

import os
import sys
import time
import argparse

from queue import Queue

Expand Down Expand Up @@ -61,10 +63,12 @@ def __call__(self, speech):

def create_input_callback(q):
"""Callback method for sounddevice InputStream."""

def input_callback(data, frames, time, status):
if status:
print(status)
q.put((data.copy(), status))

return input_callback


Expand All @@ -80,27 +84,34 @@ def end_recording(speech, marker=""):

def print_captions(text, new_cached_caption=False):
"""Prints right justified on same line, prepending cached captions."""
print('\r' + " " * MAX_LINE_LENGTH, end='', flush=True)
print("\r" + " " * MAX_LINE_LENGTH, end="", flush=True)
if len(text) > MAX_LINE_LENGTH:
text = text[-MAX_LINE_LENGTH:]
elif text != "\n":
for caption in caption_cache[::-1]:
text = (caption[:-MARKER_LENGTH] if not VERBOSE else
caption + " ") + text
text = (caption[:-MARKER_LENGTH] if not VERBOSE else caption + " ") + text
if len(text) > MAX_LINE_LENGTH:
break
if len(text) > MAX_LINE_LENGTH:
text = text[-MAX_LINE_LENGTH:]
text = " " * (MAX_LINE_LENGTH - len(text)) + text
print('\r' + text, end='', flush=True)
print("\r" + text, end="", flush=True)


if __name__ == '__main__':
model_size = "moonshine_base_onnx" if len(sys.argv) < 2 else sys.argv[1]
if model_size not in ["moonshine_base_onnx", "moonshine_tiny_onnx"]:
raise ValueError(f"Model size {model_size} is not supported.")

models_dir = os.path.join(os.path.dirname(__file__), 'models', f"{model_size}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="live_captions",
description="Live captioning demo of Moonshine models",
)
parser.add_argument(
"--model_size",
help="Model to run the demo with",
default="moonshine_base_onnx",
choices=["moonshine_base_onnx", "moonshine_tiny_onnx"],
)
args = parser.parse_args()
model_size = args.model_size
models_dir = os.path.join(os.path.dirname(__file__), "models", f"{model_size}")
print(f"Loading Moonshine model '{models_dir}' ...")
transcribe = Transcriber(models_dir=models_dir, rate=SAMPLING_RATE)

Expand Down Expand Up @@ -146,11 +157,11 @@ def print_captions(text, new_cached_caption=False):

speech_dict = vad_iterator(chunk)
if speech_dict:
if 'start' in speech_dict and not recording:
if "start" in speech_dict and not recording:
recording = True
start_time = time.time()

if 'end' in speech_dict and recording:
if "end" in speech_dict and recording:
recording = False
end_recording(speech, "<STOP>")

Expand Down