diff --git a/moonshine/demo/README.md b/moonshine/demo/README.md index 900cccc..344f97d 100644 --- a/moonshine/demo/README.md +++ b/moonshine/demo/README.md @@ -1,28 +1,193 @@ # Moonshine Demos -This directory contains various scripts the demonstrate the capabilities of the Moonshine ASR models. +This directory contains various scripts to demonstrate the capabilities of the +Moonshine ASR models. -## onnx_standalone.py +- [Moonshine Demos](#moonshine-demos) +- [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx) +- [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input) + - [Installation.](#installation) + - [0. Setup environment](#0-setup-environment) + - [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies) + - [Running the demo](#running-the-demo) + - [Script notes](#script-notes) + - [Speech truncation and hallucination](#speech-truncation-and-hallucination) + - [Running on a slower processor](#running-on-a-slower-processor) + - [Metrics](#metrics) +- [Citation](#citation) -This script demonstrates how to run a Moonshine model with the `onnxruntime` package alone, without depending on `torch` or `tensorflow`. This enables running on SBCs such as Raspberry Pi. Follow the instructions below to setup and run. -* Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`. +# Demo: Standalone file transcription with ONNX -* Download the `onnx` files from huggingface hub to a directory. +The script [`onnx_standalone.py`](/moonshine/demo/onnx_standalone.py) +demonstrates how to run a Moonshine model with the `onnxruntime` +package alone, without depending on `torch` or `tensorflow`. This enables +running on SBCs such as Raspberry Pi. Follow the instructions below to setup +and run. - ```shell - mkdir moonshine_base_onnx - cd moonshine_base_onnx - wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx - wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx - wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx - wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx - cd .. - ``` +1. Install `onnxruntime` (or `onnxruntime-gpu` if you want to run on GPUs) and `tokenizers` packages using your Python package manager of choice, such as `pip`. -* Run `onnx_standalone.py` to transcribe a wav file +2. Download the `onnx` files from huggingface hub to a directory. - ```shell - moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav - ['Ever tried ever failed, no matter try again fail again fail better.'] - ``` +```shell +mkdir moonshine_base_onnx +cd moonshine_base_onnx +wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/preprocess.onnx +wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/encode.onnx +wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/uncached_decode.onnx +wget https://huggingface.co/UsefulSensors/moonshine/resolve/main/onnx/base/cached_decode.onnx +cd .. +``` + +3. Run `onnx_standalone.py` to transcribe a wav file + +```shell +moonshine/moonshine/demo/onnx_standalone.py --models_dir moonshine_base_onnx --wav_file moonshine/moonshine/assets/beckett.wav +['Ever tried ever failed, no matter try again fail again fail better.'] +``` + + +# Demo: Live captioning from microphone input + +https://github.com/user-attachments/assets/aa65ef54-d4ac-4d31-864f-222b0e6ccbd3 + +This folder contains a demo of live captioning from microphone input, built on Moonshine. The script runs the Moonshine ONNX model on segments of speech detected in the microphone signal using a voice activity detector called [`silero-vad`](https://github.com/snakers4/silero-vad). The script prints scrolling text or "live captions" assembled from the model predictions to the console. + +The following steps have been tested in a `uv` (v0.4.25) virtual environment on these platforms: + +- macOS 14.1 on a MacBook Pro M3 +- Ubuntu 22.04 VM on a MacBook Pro M2 +- Ubuntu 24.04 VM on a MacBook Pro M2 + +## Installation + +### 0. Setup environment + +Steps to set up a virtual environment are available in the [top level README](/README.md) of this repo. Note that this demo is standalone and has no requirement to install the `useful-moonshine` package. Instead, you will clone the repo. + +### 1. Clone the repo and install extra dependencies + +You will need to clone the repo first: + +```shell +git clone git@github.com:usefulsensors/moonshine.git +``` + +Then install the demo's requirements: + +```shell +uv pip install -r moonshine/moonshine/demo/requirements.txt +``` + +There is a dependency on `torch` because of `silero-vad` package. There is no +dependency on `tensorflow`. + +#### Ubuntu: Install PortAudio + +Ubuntu needs PortAudio for the `sounddevice` package to run. The latest version (19.6.0-1.2build3 as of writing) is suitable. + +```shell +sudo apt update +sudo apt upgrade -y +sudo apt install -y portaudio19-dev +``` + +## Running the demo + +First, check that your microphone is connected and that the volume setting is not muted in your host OS or system audio drivers. Then, run the script: + +``` shell +python3 moonshine/moonshine/demo/live_captions.py +``` + +By default, this will run the demo with the Moonshine Base model using the ONNX runtime. The optional `--model_name` argument sets the model to use: supported arguments are `moonshine/base` and `moonshine/tiny`. + +When running, speak in English language to the microphone and observe live captions in the terminal. Quit the demo with `Ctrl+C` to see a full printout of the captions. + +An example run on Ubuntu 24.04 VM on MacBook Pro M2 with Moonshine base ONNX +model: + +```console +(env_moonshine_demo) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py +Error in cpuinfo: prctl(PR_SVE_GET_VL) failed +Loading Moonshine model 'moonshine/base' (ONNX runtime) ... +Press Ctrl+C to quit live captions. + +hine base model being used to generate live captions while someone is speaking. ^C + + model_name : moonshine/base + MIN_REFRESH_SECS : 0.2s + + number inferences : 25 + mean inference time : 0.14s + model realtime factor : 27.82x + +Cached captions. +This is an example of the Moonshine base model being used to generate live captions while someone is speaking. +(env_moonshine_demo) parallels@ubuntu-linux-2404:~$ +``` + +For comparison, this is the `faster-whisper` base model on the same instance. +The value of `MIN_REFRESH_SECS` was increased as the model inference is too slow +for a value of 0.2 seconds. Our Moonshine base model runs ~ 7x faster for this +example. + +```console +(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ python3 moonshine/moonshine/demo/live_captions.py +Error in cpuinfo: prctl(PR_SVE_GET_VL) failed +Loading Faster-Whisper float32 base.en model ... +Press Ctrl+C to quit live captions. + +r float32 base model being used to generate captions while someone is speaking. ^C + + model_name : base.en + MIN_REFRESH_SECS : 1.2s + + number inferences : 6 + mean inference time : 1.02s + model realtime factor : 4.82x + +Cached captions. +This is an example of the Faster Whisper float32 base model being used to generate captions while someone is speaking. +(env_moonshine_faster_whisper) parallels@ubuntu-linux-2404:~$ +``` + +## Script notes + +You may customize this script to display Moonshine text transcriptions as you wish. + +The script `live_captions.py` loads the English language version of Moonshine base ONNX model. It includes logic to detect speech activity and limit the context window of speech fed to the Moonshine model. The returned transcriptions are displayed as scrolling captions. Speech segments with pauses are cached and these cached captions are printed on exit. + +### Speech truncation and hallucination + +Some hallucinations will be seen when the script is running: one reason is speech gets truncated out of necessity to generate the frequent refresh and timeout transcriptions. Truncated speech contains partial or sliced words for which transcriber model transcriptions are unpredictable. See the printed captions on script exit for the best results. + +### Running on a slower processor + +If you run this script on a slower processor, consider using the `tiny` model. + +```shell +python3 ./moonshine/moonshine/demo/live_captions.py --model_name moonshine/tiny +``` + +The value of `MIN_REFRESH_SECS` will be ineffective when the model inference time exceeds that value. Conversely on a faster processor consider reducing the value of `MIN_REFRESH_SECS` for more frequent caption updates. On a slower processor you might also consider reducing the value of `MAX_SPEECH_SECS` to avoid slower model inferencing encountered with longer speech segments. + +### Metrics + +The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking. + + +# Citation + +If you benefit from our work, please cite us: +``` +@misc{jeffries2024moonshinespeechrecognitionlive, + title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, + author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden}, + year={2024}, + eprint={2410.15608}, + archivePrefix={arXiv}, + primaryClass={cs.SD}, + url={https://arxiv.org/abs/2410.15608}, +} +``` diff --git a/moonshine/demo/live_captions.py b/moonshine/demo/live_captions.py new file mode 100644 index 0000000..e8dcf39 --- /dev/null +++ b/moonshine/demo/live_captions.py @@ -0,0 +1,202 @@ +"""Live captions from microphone using Moonshine and SileroVAD ONNX models.""" + +import argparse +import os +import sys +import time + +from queue import Queue + +import numpy as np + +from silero_vad import load_silero_vad, VADIterator +from sounddevice import InputStream +from tokenizers import Tokenizer + +# Local import of Moonshine ONNX model. +MOONSHINE_DEMO_DIR = os.path.dirname(__file__) +sys.path.append(os.path.join(MOONSHINE_DEMO_DIR, "..")) + +from onnx_model import MoonshineOnnxModel + +SAMPLING_RATE = 16000 + +CHUNK_SIZE = 512 # Silero VAD requirement with sampling rate 16000. +LOOKBACK_CHUNKS = 5 +MARKER_LENGTH = 6 +MAX_LINE_LENGTH = 80 + +# These affect live caption updating - adjust for your platform speed and model. +MAX_SPEECH_SECS = 15 +MIN_REFRESH_SECS = 0.2 + +VERBOSE = False + + +class Transcriber(object): + def __init__(self, model_name, rate=16000): + if rate != 16000: + raise ValueError("Moonshine supports sampling rate 16000 Hz.") + self.model = MoonshineOnnxModel(model_name=model_name) + self.rate = rate + assets_dir = f"{os.path.join(os.path.dirname(__file__), '..', 'assets')}" + tokenizer_file = f"{assets_dir}{os.sep}tokenizer.json" + self.tokenizer = Tokenizer.from_file(str(tokenizer_file)) + + self.inference_secs = 0 + self.number_inferences = 0 + self.speech_secs = 0 + self.__call__(np.zeros(int(rate), dtype=np.float32)) # Warmup. + + def __call__(self, speech): + """Returns string containing Moonshine transcription of speech.""" + self.number_inferences += 1 + self.speech_secs += len(speech) / self.rate + start_time = time.time() + + tokens = self.model.generate(speech[np.newaxis, :].astype(np.float32)) + text = self.tokenizer.decode_batch(tokens)[0] + + self.inference_secs += time.time() - start_time + return text + + +def create_input_callback(q): + """Callback method for sounddevice InputStream.""" + + def input_callback(data, frames, time, status): + if status: + print(status) + q.put((data.copy().flatten(), status)) + + return input_callback + + +def end_recording(speech, marker=""): + """Transcribes, caches and prints the caption. Clears speech buffer.""" + if len(marker) != MARKER_LENGTH: + raise ValueError("Unexpected marker length.") + text = transcribe(speech) + caption_cache.append(text + " " + marker) + print_captions(text + (" " + marker) if VERBOSE else "", True) + speech *= 0.0 + + +def print_captions(text, new_cached_caption=False): + """Prints right justified on same line, prepending cached captions.""" + print("\r" + " " * MAX_LINE_LENGTH, end="", flush=True) + if len(text) > MAX_LINE_LENGTH: + text = text[-MAX_LINE_LENGTH:] + elif text != "\n": + for caption in caption_cache[::-1]: + text = (caption[:-MARKER_LENGTH] if not VERBOSE else caption + " ") + text + if len(text) > MAX_LINE_LENGTH: + break + if len(text) > MAX_LINE_LENGTH: + text = text[-MAX_LINE_LENGTH:] + text = " " * (MAX_LINE_LENGTH - len(text)) + text + print("\r" + text, end="", flush=True) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + prog="live_captions", + description="Live captioning demo of Moonshine models", + ) + parser.add_argument( + "--model_name", + help="Model to run the demo with", + default="moonshine/base", + choices=["moonshine/base", "moonshine/tiny"], + ) + args = parser.parse_args() + model_name = args.model_name + print(f"Loading Moonshine model '{model_name}' (using ONNX runtime) ...") + transcribe = Transcriber(model_name=model_name, rate=SAMPLING_RATE) + + vad_model = load_silero_vad(onnx=True) + vad_iterator = VADIterator( + model=vad_model, + sampling_rate=SAMPLING_RATE, + threshold=0.5, + min_silence_duration_ms=300, + ) + + q = Queue() + stream = InputStream( + samplerate=SAMPLING_RATE, + channels=1, + blocksize=CHUNK_SIZE, + dtype=np.float32, + callback=create_input_callback(q), + ) + stream.start() + + caption_cache = [] + lookback_size = LOOKBACK_CHUNKS * CHUNK_SIZE + speech = np.empty(0, dtype=np.float32) + + recording = False + + print("Press Ctrl+C to quit live captions.\n") + + with stream: + print_captions("Ready...") + try: + while True: + chunk, status = q.get() + if VERBOSE and status: + print(status) + + speech = np.concatenate((speech, chunk)) + if not recording: + speech = speech[-lookback_size:] + + speech_dict = vad_iterator(chunk) + if speech_dict: + if "start" in speech_dict and not recording: + recording = True + start_time = time.time() + + if "end" in speech_dict and recording: + recording = False + end_recording(speech, "") + + elif recording: + # Possible speech truncation can cause hallucination. + + if (len(speech) / SAMPLING_RATE) > MAX_SPEECH_SECS: + recording = False + end_recording(speech, "") + # Soft reset without affecting VAD model state. + vad_iterator.triggered = False + vad_iterator.temp_end = 0 + vad_iterator.current_sample = 0 + + if (time.time() - start_time) > MIN_REFRESH_SECS: + print_captions(transcribe(speech)) + start_time = time.time() + + except KeyboardInterrupt: + stream.close() + + if recording: + while not q.empty(): + chunk, _ = q.get() + speech = np.concatenate((speech, chunk)) + end_recording(speech, "") + + print(f""" + + model_name : {model_name} + MIN_REFRESH_SECS : {MIN_REFRESH_SECS}s + + number inferences : {transcribe.number_inferences} + mean inference time : {(transcribe.inference_secs / transcribe.number_inferences):.2f}s + model realtime factor : {(transcribe.speech_secs / transcribe.inference_secs):0.2f}x +""") + if caption_cache: + print("Cached captions.") + for caption in caption_cache: + print(caption[:-MARKER_LENGTH], end="", flush=True) + print("") diff --git a/moonshine/demo/requirements.txt b/moonshine/demo/requirements.txt new file mode 100644 index 0000000..95bfbf2 --- /dev/null +++ b/moonshine/demo/requirements.txt @@ -0,0 +1,3 @@ +silero_vad +sounddevice +tokenizers \ No newline at end of file