usefulsensors · evmaki · Oct 25, 2024 · Oct 24, 2024 · Oct 24, 2024 · Oct 24, 2024
diff --git a/moonshine/demo/README.md b/moonshine/demo/README.md
@@ -4,129 +4,100 @@ This directory contains various scripts to demonstrate the capabilities of the
 Moonshine ASR models.
 
 - [Moonshine Demos](#moonshine-demos)
-- [Standalone file transcription.](#standalone-file-transcription)
-- [Live caption microphone demo.](#live-caption-microphone-demo)
+- [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx)
+- [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input)
   - [Installation.](#installation)
-    - [Environment.](#environment)
-    - [Download the ONNX models.](#download-the-onnx-models)
-  - [Run the demo.](#run-the-demo)
-  - [Script notes.](#script-notes)
-    - [Speech truncation.](#speech-truncation)
-    - [Running on a slower processor.](#running-on-a-slower-processor)
-    - [Metrics.](#metrics)
-- [Future work.](#future-work)
-- [Citation.](#citation)
+    - [0. Setup environment](#0-setup-environment)
+    - [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies)
+    - [2. Download the ONNX models](#download-the-onnx-models)
+  - [Running the demo](#running-the-demo)
+  - [Script notes](#script-notes)
+    - [Speech truncation and hallucination](#speech-truncation-and-hallucination)
+    - [Running on a slower processor](#running-on-a-slower-processor)
+    - [Metrics](#metrics)
+- [Citation](#citation)
 
 
-# Standalone file transcription.
+# Demo: Standalone file transcription with ONNX
 
-The script
-[onnx_standalone.py](/moonshine/demo/onnx_standalone.py)
+The script [`onnx_standalone.py`](/moonshine/demo/onnx_standalone.py)
 demonstrates how to run a Moonshine model with the `onnxruntime`
 package alone, without depending on `torch` or `tensorflow`. This enables
 running on SBCs such as Raspberry Pi. Follow the instructions below to setup
 and run.
 
-* Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`.
+1. Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`.
 
-* Download the `onnx` files from huggingface hub to a directory.
+2. Download the `onnx` files from huggingface hub to a directory.
 
-  ```shell
-  mkdir moonshine_base_onnx
-  cd moonshine_base_onnx
-  wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx
-  wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx
-  wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx
-  wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx
-  cd ..
-  ```
+```shell
+mkdir moonshine_base_onnx
+cd moonshine_base_onnx
+wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx
+wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx
+wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx
+wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx
+cd ..
+```
 
-* Run `onnx_standalone.py` to transcribe a wav file
+3. Run `onnx_standalone.py` to transcribe a wav file
 
-  ```shell
-  moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav
-  ['Ever tried ever failed, no matter try again fail again fail better.']
-  ```
+```shell
+moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav
+['Ever tried ever failed, no matter try again fail again fail better.']
+```
 
 
-# Live caption microphone demo.
+# Demo: Live captioning from microphone input
 
-The script
-[live_captions.py](/moonshine/demo/live_captions.py) runs Moonshine model on segments of speech detected in the microphone
-signal using a voice activity detector called
-[silero-vad](https://github.com/snakers4/silero-vad).  The script prints
-scrolling text or "live captions" assembled from the model predictions.
+https://github.com/user-attachments/assets/aa65ef54-d4ac-4d31-864f-222b0e6ccbd3
 
-The following steps were tested in `uv` virtual environment v0.4.25 created in
-Ubuntu 22.04 and Ubuntu 24.04 home folders running on a MacBook Pro M2 (ARM)
-virtual machine.
+This folder contains a demo of live captioning from microphone input, built on Moonshine. The script runs the Moonshine model on segments of speech detected in the microphone signal using a voice activity detector called [`silero-vad`](https://github.com/snakers4/silero-vad). The script prints scrolling text or "live captions" assembled from the model predictions to the console.
 
-- [Moonshine Demos](#moonshine-demos)
-- [Standalone file transcription.](#standalone-file-transcription)
-- [Live caption microphone demo.](#live-caption-microphone-demo)
-  - [Installation.](#installation)
-    - [Environment.](#environment)
-    - [Download the ONNX models.](#download-the-onnx-models)
-  - [Run the demo.](#run-the-demo)
-  - [Script notes.](#script-notes)
-    - [Speech truncation.](#speech-truncation)
-    - [Running on a slower processor.](#running-on-a-slower-processor)
-    - [Metrics.](#metrics)
-- [Future work.](#future-work)
-- [Citation.](#citation)
-
-## Installation.
-
-This install does not depend on `tensorflow`.  We're using
-[silero-vad](https://github.com/snakers4/silero-vad)
-which has `torch` dependency.
-
-### Environment.
-
-Moonshine installation steps are available in the
-[top level README](/README.md) of this repo.  Note that this demo is standalone
-and has no requirement to install `useful-moonshine`.
-
-First install the `uv` standalone installer as
-[described here](https://github.com/astral-sh/uv?tab=readme-ov-file#installation).
-Close the shell and re-open after the install.  If you don't want to use `uv`
-simply skip the virtual environment installation and subsequent activation, and
-leave `uv` off the shell commands for `pip install`.
-
-Create the virtual environment and install dependences for Moonshine.
-```console
-cd
-uv venv env_moonshine_demo
-source env_moonshine_demo/bin/activate
-```
+The following steps have been tested in a `uv` (v0.4.25) virtual environment on these platforms:
+
+- macOS 14.1 on a MacBook Pro M3
+- Ubuntu 22.04 VM on a MacBook Pro M2
+- Ubuntu 24.04 VM on a MacBook Pro M2
+
+## Installation
+
+### 0. Setup environment
+
+Steps to set up a virtual environment are available in the [top level README](/README.md) of this repo. Note that this demo is standalone and has no requirement to install the `useful-moonshine` package. Instead, you will clone the repo.
+
+### 1. Clone the repo and install extra dependencies
 
 You will need to clone the repo first:
-```console
+
+```shell
 git clone [email protected]:usefulsensors/moonshine.git
 ```
 
 Then install the demo's extra requirements:
-```console
+
+```shell
 uv pip install -r moonshine/moonshine/demo/requirements.txt
 ```
 
-Ubuntu needs PortAudio installing for the package `sounddevice` to run.  The
-latest version 19.6.0-1.2build3 is suitable.
-```console
-cd
+#### Ubuntu: Install PortAudio
+
+Ubuntu needs PortAudio for the `sounddevice` package to run. The latest version (19.6.0-1.2build3 as of writing) is suitable.
+
+```shell
 sudo apt update
 sudo apt upgrade -y
 sudo apt install -y portaudio19-dev
 ```
 
-### Download the ONNX models.
+### 2. Download the ONNX models
 
 The script finds ONNX base or tiny models in the
-`demo/models//moonshine_base_onnx` and `demo/models//moonshine_tiny_onnx`
+`demo/models/moonshine_base_onnx` and `demo/models/moonshine_tiny_onnx`
 sub-folders.
 
-Download Moonshine `onnx` model files from huggingface hub.
-```console
+Download Moonshine `onnx` model files from HuggingFace hub.
+```shell
 cd
 mkdir moonshine/moonshine/demo/models
 mkdir moonshine/moonshine/demo/models/moonshine_base_onnx
@@ -147,21 +118,21 @@ wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/uncac
 wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/tiny/cached_decode.onnx
 ```
 
-## Run the demo.
+## Running the demo
 
-Check your microphone is connected and the microphone volume setting is not
-muted in your host OS or system audio drivers.
-```console
-cd
-source env_moonshine_demo/bin/activate
+First, check that your microphone is connected and that the volume setting is not muted in your host OS or system audio drivers. Then, run the script:
 
+``` shell
 python3 moonshine/moonshine/demo/live_captions.py
 ```
-Speak in English language to the microphone and observe live captions in the
-terminal.  Quit the demo with ctrl + C to see console print of the captions.
+
+By default, this will run the demo with the Moonshine base ONNX model. The `--model_size` argument sets the model to use: supported arguments are `moonshine_base_onnx` and `moonshine_tiny_onnx`. 
+
+When running, speak in English language to the microphone and observe live captions in the terminal. Quit the demo with `Ctrl+C` to see a full printout of the captions.
 
 An example run on Ubuntu 24.04 VM on MacBook Pro M2 with Moonshine base ONNX
-model.
+model:
+
 ```console
 (env_moonshine_demo) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py
 Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
@@ -182,9 +153,10 @@ This is an example of the Moonshine base model being used to generate live capti
 (env_moonshine_demo) parallels@ubuntu-linux-2404:~$
 ```
 
-For comparison this is the Faster-Whisper int8 base model on the same instance.
+For comparison, this is the `faster-whisper` int8 base model on the same instance.
 The value of `MIN_REFRESH_SECS` was increased as the model inference is too slow
 for a value of 0.2 seconds.
+
 ```console
 (env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py
 Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
@@ -205,56 +177,32 @@ This is an example of the Faster Whisper float32 base model being used to genera
 (env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$
 ```
 
-## Script notes.
+## Script notes
 
 You may customize this script to display Moonshine text transcriptions as you wish.
 
-The script `live_captions.py` loads the English language version of Moonshine
-base ONNX model.  The script includes logic to detect speech activity and limit
-the context window of speech fed to the Moonshine model.  The returned
-transcriptions are displayed as scrolling captions.  Speech segments with pauses
-are cached and these cached captions are printed on exit.  The printed captions
-on exit will not contain the latest displayed caption when there was no pause
-in the talker's speech prior to pressing ctrl + C.  Stop speaking and wait
-before pressing ctrl + C.  If you are running on a slow or throttled processor
-such that the model inferences are not realtime, after speaking stops you should
-wait longer for the speech queue to be processed before pressing ctrl + C.
-
-### Speech truncation.
-
-Some hallucinations will be seen when the script is running: one reason is
-speech gets truncated out of necessity to generate the frequent refresh and
-timeout transcriptions.  Truncated speech contains partial or sliced words for
-which transcriber model transcriptions are unpredictable.  See the printed
-captions on script exit for the best results.
-
-### Running on a slower processor.
-If you run this script on a slower processor consider using the `tiny` model.
-```console
-cd
-source env_moonshine_demo/bin/activate
+The script `live_captions.py` loads the English language version of Moonshine base ONNX model. It includes logic to detect speech activity and limit the context window of speech fed to the Moonshine model. The returned transcriptions are displayed as scrolling captions. Speech segments with pauses are cached and these cached captions are printed on exit. The printed captions on exit will not contain the latest displayed caption when there was no pause in the talker's speech prior to pressing `Ctrl+C`. Stop speaking and wait before pressing `Ctrl+C`. If you are running on a slow or throttled processor such that the model inferences are not realtime, after speaking stops you should wait longer for the speech queue to be processed before pressing `Ctrl+C`.
+
+### Speech truncation and hallucination
 
-python3 ./moonshine/moonshine/demo/live_captions.py moonshine_tiny_onnx
+Some hallucinations will be seen when the script is running: one reason is speech gets truncated out of necessity to generate the frequent refresh and timeout transcriptions. Truncated speech contains partial or sliced words for which transcriber model transcriptions are unpredictable. See the printed captions on script exit for the best results.
+
+### Running on a slower processor
+
+If you run this script on a slower processor, consider using the `tiny` model.
+
+```shell
+python3 ./moonshine/moonshine/demo/live_captions.py --model_size moonshine_tiny_onnx
 ```
-The value of `MIN_REFRESH_SECS` will be ineffective when the model inference
-time exceeds that value.  Conversely on a faster processor consider reducing
-the value of `MIN_REFRESH_SECS` for more frequent caption updates.  On a slower
-processor you might also consider reducing the value of `MAX_SPEECH_SECS` to
-avoid slower model inferencing encountered with longer speech segments.
 
-### Metrics.
-The metrics shown on program exit will vary based on the talker's speaking
-style.  If the talker speaks with more frequent pauses the speech segments are
-shorter and the mean inference time will be lower.  This is a feature of the
-Moonshine model described in
-[our paper](https://arxiv.org/abs/2410.15608).
-When benchmarking use the same speech such as a recording of someone talking.
+The value of `MIN_REFRESH_SECS` will be ineffective when the model inference time exceeds that value.  Conversely on a faster processor consider reducing the value of `MIN_REFRESH_SECS` for more frequent caption updates.  On a slower processor you might also consider reducing the value of `MAX_SPEECH_SECS` to avoid slower model inferencing encountered with longer speech segments.
+
+### Metrics
 
-# Future work.
+The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking.
 
-* [x] ONNX runtime model version.
 
-# Citation.
+# Citation
 
 If you benefit from our work, please cite us:
 ```

diff --git a/moonshine/demo/live_captions.py b/moonshine/demo/live_captions.py
@@ -1,7 +1,9 @@
 """Live captions from microphone using Moonshine and SileroVAD ONNX models."""
+
 import os
 import sys
 import time
+import argparse
 
 from queue import Queue
 
@@ -61,10 +63,12 @@ def __call__(self, speech):
 
 def create_input_callback(q):
     """Callback method for sounddevice InputStream."""
+
     def input_callback(data, frames, time, status):
         if status:
             print(status)
         q.put((data.copy(), status))
+
     return input_callback
 
 
@@ -80,27 +84,34 @@ def end_recording(speech, marker=""):
 
 def print_captions(text, new_cached_caption=False):
     """Prints right justified on same line, prepending cached captions."""
-    print('\r' + " " * MAX_LINE_LENGTH, end='', flush=True)
+    print("\r" + " " * MAX_LINE_LENGTH, end="", flush=True)
     if len(text) > MAX_LINE_LENGTH:
         text = text[-MAX_LINE_LENGTH:]
     elif text != "\n":
         for caption in caption_cache[::-1]:
-            text = (caption[:-MARKER_LENGTH] if not VERBOSE else
-                    caption + " ") + text
+            text = (caption[:-MARKER_LENGTH] if not VERBOSE else caption + " ") + text
             if len(text) > MAX_LINE_LENGTH:
                 break
         if len(text) > MAX_LINE_LENGTH:
             text = text[-MAX_LINE_LENGTH:]
     text = " " * (MAX_LINE_LENGTH - len(text)) + text
-    print('\r' + text, end='', flush=True)
+    print("\r" + text, end="", flush=True)
 
 
-if __name__ == '__main__':
-    model_size = "moonshine_base_onnx" if len(sys.argv) < 2 else sys.argv[1]
-    if model_size not in ["moonshine_base_onnx", "moonshine_tiny_onnx"]:
-        raise ValueError(f"Model size {model_size} is not supported.")
-
-    models_dir = os.path.join(os.path.dirname(__file__), 'models', f"{model_size}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        prog="live_captions",
+        description="Live captioning demo of Moonshine models",
+    )
+    parser.add_argument(
+        "--model_size",
+        help="Model to run the demo with",
+        default="moonshine_base_onnx",
+        choices=["moonshine_base_onnx", "moonshine_tiny_onnx"],
+    )
+    args = parser.parse_args()
+    model_size = args.model_size
+    models_dir = os.path.join(os.path.dirname(__file__), "models", f"{model_size}")
     print(f"Loading Moonshine model '{models_dir}' ...")
     transcribe = Transcriber(models_dir=models_dir, rate=SAMPLING_RATE)
 
@@ -146,11 +157,11 @@ def print_captions(text, new_cached_caption=False):
 
                 speech_dict = vad_iterator(chunk)
                 if speech_dict:
-                    if 'start' in speech_dict and not recording:
+                    if "start" in speech_dict and not recording:
                         recording = True
                         start_time = time.time()
 
-                    if 'end' in speech_dict and recording:
+                    if "end" in speech_dict and recording:
                         recording = False
                         end_recording(speech, "<STOP>")