-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format documentation; switch to argparse #34
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
2582569
Format documentation; add example
evmaki bc95613
Merge branch 'guy/live_captions' into guy/evan/live_captions
evmaki 2525278
Update README.md
evmaki 9b69b15
Delete moonshine/demo/example.mp4
evmaki ff04bb5
Merge branch 'guy/live_captions' into guy/evan/live_captions
evmaki 37f406f
Adjust formatting
evmaki 682da35
Adjust formatting
evmaki 6c1e62f
Switch to argparse for arg parsing
evmaki 6a277b3
Merge branch 'guy/live_captions' into guy/evan/live_captions
evmaki ff4f206
Apply code formatting
evmaki File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,129 +4,100 @@ This directory contains various scripts to demonstrate the capabilities of the | |
Moonshine ASR models. | ||
|
||
- [Moonshine Demos](#moonshine-demos) | ||
- [Standalone file transcription.](#standalone-file-transcription) | ||
- [Live caption microphone demo.](#live-caption-microphone-demo) | ||
- [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx) | ||
- [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input) | ||
- [Installation.](#installation) | ||
- [Environment.](#environment) | ||
- [Download the ONNX models.](#download-the-onnx-models) | ||
- [Run the demo.](#run-the-demo) | ||
- [Script notes.](#script-notes) | ||
- [Speech truncation.](#speech-truncation) | ||
- [Running on a slower processor.](#running-on-a-slower-processor) | ||
- [Metrics.](#metrics) | ||
- [Future work.](#future-work) | ||
- [Citation.](#citation) | ||
- [0. Setup environment](#0-setup-environment) | ||
- [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies) | ||
- [2. Download the ONNX models](#download-the-onnx-models) | ||
- [Running the demo](#running-the-demo) | ||
- [Script notes](#script-notes) | ||
- [Speech truncation and hallucination](#speech-truncation-and-hallucination) | ||
- [Running on a slower processor](#running-on-a-slower-processor) | ||
- [Metrics](#metrics) | ||
- [Citation](#citation) | ||
|
||
|
||
# Standalone file transcription. | ||
# Demo: Standalone file transcription with ONNX | ||
|
||
The script | ||
[onnx_standalone.py](/moonshine/demo/onnx_standalone.py) | ||
The script [`onnx_standalone.py`](/moonshine/demo/onnx_standalone.py) | ||
demonstrates how to run a Moonshine model with the `onnxruntime` | ||
package alone, without depending on `torch` or `tensorflow`. This enables | ||
running on SBCs such as Raspberry Pi. Follow the instructions below to setup | ||
and run. | ||
|
||
* Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`. | ||
1. Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`. | ||
|
||
* Download the `onnx` files from huggingface hub to a directory. | ||
2. Download the `onnx` files from huggingface hub to a directory. | ||
|
||
```shell | ||
mkdir moonshine_base_onnx | ||
cd moonshine_base_onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx | ||
cd .. | ||
``` | ||
```shell | ||
mkdir moonshine_base_onnx | ||
cd moonshine_base_onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx | ||
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx | ||
cd .. | ||
``` | ||
|
||
* Run `onnx_standalone.py` to transcribe a wav file | ||
3. Run `onnx_standalone.py` to transcribe a wav file | ||
|
||
```shell | ||
moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav | ||
['Ever tried ever failed, no matter try again fail again fail better.'] | ||
``` | ||
```shell | ||
moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav | ||
['Ever tried ever failed, no matter try again fail again fail better.'] | ||
``` | ||
|
||
|
||
# Live caption microphone demo. | ||
# Demo: Live captioning from microphone input | ||
|
||
The script | ||
[live_captions.py](/moonshine/demo/live_captions.py) runs Moonshine model on segments of speech detected in the microphone | ||
signal using a voice activity detector called | ||
[silero-vad](https://github.com/snakers4/silero-vad). The script prints | ||
scrolling text or "live captions" assembled from the model predictions. | ||
https://github.com/user-attachments/assets/aa65ef54-d4ac-4d31-864f-222b0e6ccbd3 | ||
|
||
The following steps were tested in `uv` virtual environment v0.4.25 created in | ||
Ubuntu 22.04 and Ubuntu 24.04 home folders running on a MacBook Pro M2 (ARM) | ||
virtual machine. | ||
This folder contains a demo of live captioning from microphone input, built on Moonshine. The script runs the Moonshine model on segments of speech detected in the microphone signal using a voice activity detector called [`silero-vad`](https://github.com/snakers4/silero-vad). The script prints scrolling text or "live captions" assembled from the model predictions to the console. | ||
|
||
- [Moonshine Demos](#moonshine-demos) | ||
- [Standalone file transcription.](#standalone-file-transcription) | ||
- [Live caption microphone demo.](#live-caption-microphone-demo) | ||
- [Installation.](#installation) | ||
- [Environment.](#environment) | ||
- [Download the ONNX models.](#download-the-onnx-models) | ||
- [Run the demo.](#run-the-demo) | ||
- [Script notes.](#script-notes) | ||
- [Speech truncation.](#speech-truncation) | ||
- [Running on a slower processor.](#running-on-a-slower-processor) | ||
- [Metrics.](#metrics) | ||
- [Future work.](#future-work) | ||
- [Citation.](#citation) | ||
|
||
## Installation. | ||
|
||
This install does not depend on `tensorflow`. We're using | ||
[silero-vad](https://github.com/snakers4/silero-vad) | ||
which has `torch` dependency. | ||
|
||
### Environment. | ||
|
||
Moonshine installation steps are available in the | ||
[top level README](/README.md) of this repo. Note that this demo is standalone | ||
and has no requirement to install `useful-moonshine`. | ||
|
||
First install the `uv` standalone installer as | ||
[described here](https://github.com/astral-sh/uv?tab=readme-ov-file#installation). | ||
Close the shell and re-open after the install. If you don't want to use `uv` | ||
simply skip the virtual environment installation and subsequent activation, and | ||
leave `uv` off the shell commands for `pip install`. | ||
|
||
Create the virtual environment and install dependences for Moonshine. | ||
```console | ||
cd | ||
uv venv env_moonshine_demo | ||
source env_moonshine_demo/bin/activate | ||
``` | ||
The following steps have been tested in a `uv` (v0.4.25) virtual environment on these platforms: | ||
|
||
- macOS 14.1 on a MacBook Pro M3 | ||
- Ubuntu 22.04 VM on a MacBook Pro M2 | ||
- Ubuntu 24.04 VM on a MacBook Pro M2 | ||
|
||
## Installation | ||
|
||
### 0. Setup environment | ||
|
||
Steps to set up a virtual environment are available in the [top level README](/README.md) of this repo. Note that this demo is standalone and has no requirement to install the `useful-moonshine` package. Instead, you will clone the repo. | ||
|
||
### 1. Clone the repo and install extra dependencies | ||
|
||
You will need to clone the repo first: | ||
```console | ||
|
||
```shell | ||
git clone [email protected]:usefulsensors/moonshine.git | ||
``` | ||
|
||
Then install the demo's extra requirements: | ||
```console | ||
|
||
```shell | ||
uv pip install -r moonshine/moonshine/demo/requirements.txt | ||
``` | ||
|
||
Ubuntu needs PortAudio installing for the package `sounddevice` to run. The | ||
latest version 19.6.0-1.2build3 is suitable. | ||
```console | ||
cd | ||
#### Ubuntu: Install PortAudio | ||
|
||
Ubuntu needs PortAudio for the `sounddevice` package to run. The latest version (19.6.0-1.2build3 as of writing) is suitable. | ||
|
||
```shell | ||
sudo apt update | ||
sudo apt upgrade -y | ||
sudo apt install -y portaudio19-dev | ||
``` | ||
|
||
### Download the ONNX models. | ||
### 2. Download the ONNX models | ||
|
||
The script finds ONNX base or tiny models in the | ||
`demo/models//moonshine_base_onnx` and `demo/models//moonshine_tiny_onnx` | ||
`demo/models/moonshine_base_onnx` and `demo/models/moonshine_tiny_onnx` | ||
sub-folders. | ||
|
||
Download Moonshine `onnx` model files from huggingface hub. | ||
```console | ||
Download Moonshine `onnx` model files from HuggingFace hub. | ||
```shell | ||
cd | ||
mkdir moonshine/moonshine/demo/models | ||
mkdir moonshine/moonshine/demo/models/moonshine_base_onnx | ||
|
@@ -147,21 +118,21 @@ wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/uncac | |
wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/cached_decode.onnx | ||
``` | ||
|
||
## Run the demo. | ||
## Running the demo | ||
|
||
Check your microphone is connected and the microphone volume setting is not | ||
muted in your host OS or system audio drivers. | ||
```console | ||
cd | ||
source env_moonshine_demo/bin/activate | ||
First, check that your microphone is connected and that the volume setting is not muted in your host OS or system audio drivers. Then, run the script: | ||
|
||
``` shell | ||
python3 moonshine/moonshine/demo/live_captions.py | ||
``` | ||
Speak in English language to the microphone and observe live captions in the | ||
terminal. Quit the demo with ctrl + C to see console print of the captions. | ||
|
||
By default, this will run the demo with the Moonshine base ONNX model. The `--model_size` argument sets the model to use: supported arguments are `moonshine_base_onnx` and `moonshine_tiny_onnx`. | ||
|
||
When running, speak in English language to the microphone and observe live captions in the terminal. Quit the demo with `Ctrl+C` to see a full printout of the captions. | ||
|
||
An example run on Ubuntu 24.04 VM on MacBook Pro M2 with Moonshine base ONNX | ||
model. | ||
model: | ||
|
||
```console | ||
(env_moonshine_demo) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py | ||
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed | ||
|
@@ -182,9 +153,10 @@ This is an example of the Moonshine base model being used to generate live capti | |
(env_moonshine_demo) parallels@ubuntu-linux-2404:~$ | ||
``` | ||
|
||
For comparison this is the Faster-Whisper int8 base model on the same instance. | ||
For comparison, this is the `faster-whisper` int8 base model on the same instance. | ||
The value of `MIN_REFRESH_SECS` was increased as the model inference is too slow | ||
for a value of 0.2 seconds. | ||
|
||
```console | ||
(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py | ||
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed | ||
|
@@ -205,56 +177,32 @@ This is an example of the Faster Whisper float32 base model being used to genera | |
(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ | ||
``` | ||
|
||
## Script notes. | ||
## Script notes | ||
|
||
You may customize this script to display Moonshine text transcriptions as you wish. | ||
|
||
The script `live_captions.py` loads the English language version of Moonshine | ||
base ONNX model. The script includes logic to detect speech activity and limit | ||
the context window of speech fed to the Moonshine model. The returned | ||
transcriptions are displayed as scrolling captions. Speech segments with pauses | ||
are cached and these cached captions are printed on exit. The printed captions | ||
on exit will not contain the latest displayed caption when there was no pause | ||
in the talker's speech prior to pressing ctrl + C. Stop speaking and wait | ||
before pressing ctrl + C. If you are running on a slow or throttled processor | ||
such that the model inferences are not realtime, after speaking stops you should | ||
wait longer for the speech queue to be processed before pressing ctrl + C. | ||
|
||
### Speech truncation. | ||
|
||
Some hallucinations will be seen when the script is running: one reason is | ||
speech gets truncated out of necessity to generate the frequent refresh and | ||
timeout transcriptions. Truncated speech contains partial or sliced words for | ||
which transcriber model transcriptions are unpredictable. See the printed | ||
captions on script exit for the best results. | ||
|
||
### Running on a slower processor. | ||
If you run this script on a slower processor consider using the `tiny` model. | ||
```console | ||
cd | ||
source env_moonshine_demo/bin/activate | ||
The script `live_captions.py` loads the English language version of Moonshine base ONNX model. It includes logic to detect speech activity and limit the context window of speech fed to the Moonshine model. The returned transcriptions are displayed as scrolling captions. Speech segments with pauses are cached and these cached captions are printed on exit. The printed captions on exit will not contain the latest displayed caption when there was no pause in the talker's speech prior to pressing `Ctrl+C`. Stop speaking and wait before pressing `Ctrl+C`. If you are running on a slow or throttled processor such that the model inferences are not realtime, after speaking stops you should wait longer for the speech queue to be processed before pressing `Ctrl+C`. | ||
|
||
### Speech truncation and hallucination | ||
|
||
python3 ./moonshine/moonshine/demo/live_captions.py moonshine_tiny_onnx | ||
Some hallucinations will be seen when the script is running: one reason is speech gets truncated out of necessity to generate the frequent refresh and timeout transcriptions. Truncated speech contains partial or sliced words for which transcriber model transcriptions are unpredictable. See the printed captions on script exit for the best results. | ||
|
||
### Running on a slower processor | ||
|
||
If you run this script on a slower processor, consider using the `tiny` model. | ||
|
||
```shell | ||
python3 ./moonshine/moonshine/demo/live_captions.py --model_size moonshine_tiny_onnx | ||
``` | ||
The value of `MIN_REFRESH_SECS` will be ineffective when the model inference | ||
time exceeds that value. Conversely on a faster processor consider reducing | ||
the value of `MIN_REFRESH_SECS` for more frequent caption updates. On a slower | ||
processor you might also consider reducing the value of `MAX_SPEECH_SECS` to | ||
avoid slower model inferencing encountered with longer speech segments. | ||
|
||
### Metrics. | ||
The metrics shown on program exit will vary based on the talker's speaking | ||
style. If the talker speaks with more frequent pauses the speech segments are | ||
shorter and the mean inference time will be lower. This is a feature of the | ||
Moonshine model described in | ||
[our paper](https://arxiv.org/abs/2410.15608). | ||
When benchmarking use the same speech such as a recording of someone talking. | ||
The value of `MIN_REFRESH_SECS` will be ineffective when the model inference time exceeds that value. Conversely on a faster processor consider reducing the value of `MIN_REFRESH_SECS` for more frequent caption updates. On a slower processor you might also consider reducing the value of `MAX_SPEECH_SECS` to avoid slower model inferencing encountered with longer speech segments. | ||
|
||
### Metrics | ||
|
||
# Future work. | ||
The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking. | ||
|
||
* [x] ONNX runtime model version. | ||
|
||
# Citation. | ||
# Citation | ||
|
||
If you benefit from our work, please cite us: | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the example is float32 precision (not int8): this is a copy of my error. I can fix this in the parent branch.