diff --git a/README.md b/README.md index 8dc84d3..b289128 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,7 @@ This repo hosts the inference code for Moonshine. - [Examples](#examples) - [Onnx standalone](#onnx-standalone) - [Live Captions](#live-captions) + - [Long-form transcription](#long-form-transcription) - [CTranslate2](#ctranslate2) - [TODO](#todo) - [Citation](#citation) @@ -131,17 +132,21 @@ The latest versions of the Onnx Moonshine models are available on HuggingFace at You can try the Moonshine models with live input from a microphone on many platforms with the [live captions demo](/moonshine/demo/README.md#demo-live-captioning-from-microphone-input). +### Long-form transcription + +A common approach to "long-form" transcription involves segmenting speech before running a model transcription for each segment. A single transcription is then assembled from the results. One method for segmentation is to locate pauses in the speech. You can try the Moonshine models with this segmentation method for long-form WAV files in the [file transcription demo](/moonshine/demo/README.md#demo-standalone-long-form-file-transcription). + ### CTranslate2 The files for the CTranslate2 versions of Moonshine are available at [huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2](https://huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2), but they require [a pull request to be merged](https://github.com/OpenNMT/CTranslate2/pull/1808) before they can be used with the mainline version of the framework. Until then, you should be able to try them with [our branch](https://github.com/njeffrie/CTranslate2/tree/master), with [this example script](https://github.com/OpenNMT/CTranslate2/pull/1808#issuecomment-2439725339). ## TODO * [x] Live transcription demo - + * [x] ONNX model - + * [ ] CTranslate2 support - + * [ ] MLX support * [ ] Fine-tuning code @@ -152,12 +157,12 @@ The files for the CTranslate2 versions of Moonshine are available at [huggingfac If you benefit from our work, please cite us: ``` @misc{jeffries2024moonshinespeechrecognitionlive, - title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, + title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden}, year={2024}, eprint={2410.15608}, archivePrefix={arXiv}, primaryClass={cs.SD}, - url={https://arxiv.org/abs/2410.15608}, + url={https://arxiv.org/abs/2410.15608}, } ``` diff --git a/moonshine/assets/a_tale_of_two_cities.wav b/moonshine/assets/a_tale_of_two_cities.wav new file mode 100755 index 0000000..fbc93d9 Binary files /dev/null and b/moonshine/assets/a_tale_of_two_cities.wav differ diff --git a/moonshine/demo/README.md b/moonshine/demo/README.md index 344f97d..b18af23 100644 --- a/moonshine/demo/README.md +++ b/moonshine/demo/README.md @@ -6,14 +6,19 @@ Moonshine ASR models. - [Moonshine Demos](#moonshine-demos) - [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx) - [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input) - - [Installation.](#installation) + - [Installation](#installation) - [0. Setup environment](#0-setup-environment) - [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies) + - [Ubuntu: Install PortAudio](#ubuntu-install-portaudio) - [Running the demo](#running-the-demo) - [Script notes](#script-notes) - [Speech truncation and hallucination](#speech-truncation-and-hallucination) - [Running on a slower processor](#running-on-a-slower-processor) - [Metrics](#metrics) +- [Demo: Standalone long-form file transcription](#demo-standalone-long-form-file-transcription) + - [Installation](#installation-1) + - [Running the demo](#running-the-demo-1) + - [Script notes](#script-notes-1) - [Citation](#citation) @@ -176,6 +181,53 @@ The value of `MIN_REFRESH_SECS` will be ineffective when the model inference tim The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking. +# Demo: Standalone long-form file transcription + +The script [`file_transcription.py`](/moonshine/demo/file_transcription.py) +demonstrates "long-form" transcription using a WAV file as input to the +Moonshine ONNX model. The demo loads a WAV file of length 1.5 minutes. + +## Installation + +Follow the [same installation steps](#installation) used for live captions demo. + +## Running the demo + +``` shell +python3 moonshine/moonshine/demo/file_transcription.py +``` + +An example run on Ubuntu 22.04 VM on MacBook Pro x86 with Moonshine base ONNX +model: + +```console +(env_moonshine_demo) parallels@ubuntu-linux-22-04-02-desktop:~$ python3 moonshine/moonshine/demo/file_transcription.py + +It was the best of times, it was the worst of times. It was the age of wisdom, it was the age of foolishness. It was the epoch of belief, it was the epoch of incredulity. It was the season of light, it was the season of darkness. It was the spring of hope, it was the winter of despair. We had everything before us, we had nothing before us. We were all going direct to heaven, we were all going direct the other way. In short, the period was so far like the present period that some of its noisiest authorities insisted on its being received for good or for evil in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face on the throne of England. There were a king with a large jaw and a queen with a fair face on the throne of France. In both countries it was clearer than crystal to the lords of the state preserves of loaves and fishes that things in general were settled forever. It was the year of our Lord 1775. + + model realtime factor: 10.31x + +(env_moonshine_demo) parallels@ubuntu-linux-22-04-02-desktop:~$ +``` + +You may load other WAV files using the command line argument `--wav_path`. + +## Script notes + +This demo script uses +[`silero-vad`](https://github.com/snakers4/silero-vad) voice activity detector +to segment the speech based on talker pauses. The parameters used in our script +are the same used in faster-whisper's +[implementation](https://github.com/SYSTRAN/faster-whisper/blob/814472fdbf7faf5d77d65cdb81b1528c0dead02a/faster_whisper/vad.py#L14) +for silero-vad. We validated these parameters for Moonshine base model by WER +testing several long-form datasets and saw similar WER values compared with +OpenAI Whisper base.en and faster-whisper base.en models. + +We adopt a simple strategy of concatenation of the predicted texts for this +demo. We note there are other published methods such as overlap and common +sequence matching and thus we see room for improvement on our demo method. For +instance other methods may generate more accurate transcriptions for talkers who +rarely pause when speaking for extended periods. # Citation diff --git a/moonshine/demo/file_transcription.py b/moonshine/demo/file_transcription.py new file mode 100644 index 0000000..8f66ebf --- /dev/null +++ b/moonshine/demo/file_transcription.py @@ -0,0 +1,88 @@ +"""WAV file long-form transcription with Moonshine ONNX models.""" + +import argparse +import os +import sys +import time +import wave + +import numpy as np +import tokenizers + +from silero_vad import get_speech_timestamps, load_silero_vad + +MOONSHINE_DEMO_DIR = os.path.dirname(__file__) +sys.path.append(os.path.join(MOONSHINE_DEMO_DIR, "..")) + +from onnx_model import MoonshineOnnxModel + + +def main(model_name, wav_path): + model = MoonshineOnnxModel(model_name=model_name) + + tokenizer = tokenizers.Tokenizer.from_file( + os.path.join(MOONSHINE_DEMO_DIR, "..", "assets", "tokenizer.json") + ) + + with wave.open(wav_path) as f: + params = f.getparams() + assert ( + params.nchannels == 1 + and params.framerate == 16000 + and params.sampwidth == 2 + ), f"WAV file must have 1 channel, 16KHz rate, and int16 precision." + audio = f.readframes(params.nframes) + audio = np.frombuffer(audio, np.int16) / np.iinfo(np.int16).max + audio = audio.astype(np.float32) + + vad_model = load_silero_vad() + speech_timestamps = get_speech_timestamps( + audio, + vad_model, + max_speech_duration_s=30, + min_silence_duration_ms=2000, + min_speech_duration_ms=250, + speech_pad_ms=400, + ) + chunks = [audio[ts["start"] : ts["end"]] for ts in speech_timestamps] + + chunks_length = 0 + transcription = "" + + start_time = time.time() + + for chunk in chunks: + tokens = model.generate(chunk[None, ...]) + transcription += tokenizer.decode_batch(tokens)[0] + " " + + chunks_length += len(chunk) + + time_took = time.time() - start_time + + print(f""" +{transcription} + + model realtime factor: {((chunks_length / 16000) / time_took):.2f}x +""") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + prog="file_transcription.py", + description="Standalone file transcription with Moonshine ONNX models.", + ) + parser.add_argument( + "--model_name", + help="Model to run the demo with.", + default="moonshine/base", + choices=["moonshine/base", "moonshine/tiny"], + ) + parser.add_argument( + "--wav_path", + help="Path to speech WAV file.", + default=os.path.join( + MOONSHINE_DEMO_DIR, "..", "assets", "a_tale_of_two_cities.wav" + ), + ) + args = parser.parse_args() + main(**vars(args))