diff --git a/bundles/org.openhab.voice.whisperstt/README.md b/bundles/org.openhab.voice.whisperstt/README.md index b5a88a390bcdc..e75f0dc592b33 100644 --- a/bundles/org.openhab.voice.whisperstt/README.md +++ b/bundles/org.openhab.voice.whisperstt/README.md @@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications. +Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API. + Whisper enables speech recognition for multiple languages and dialects: english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish, @@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala, hausa, bashkir, javanese and sundanese. -## Supported platforms +## Local mode (offline) + +### Supported platforms -This add-on uses some native binaries to work. +This add-on uses some native binaries to work when performing offline recognition. You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni). The following platforms are supported: @@ -28,7 +32,7 @@ The following platforms are supported: The native binaries for those platforms are included in this add-on provided with the openHAB distribution. -## CPU compatibility +### CPU compatibility To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU. The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds. @@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`. If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`. You can check those flags on linux using the terminal with `lscpu`. -## Transcription time +### Transcription time On a Raspberry PI 5, the approximate transcription times are: | model | exec time | -| ---------- | --------: | +|------------|----------:| | tiny.bin | 1.5s | | base.bin | 3s | | small.bin | 8.5s | | medium.bin | 17s | -## Configuring the model +### Configuring the model Before you can use this service you should configure your model. @@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\/whisper/' so Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link. -## Using alternative whisper.cpp library +### Using alternative whisper.cpp library It's possible to use your own build of the whisper.cpp shared library with this add-on. @@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi Note: You need to restart openHAB to reload the library. -## Grammar +### Grammar The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model. @@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+ You can provide the grammar and enable its usage using the binding configuration. +## API mode + +You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI). + +You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server. + +Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available. + ## Configuration Use your favorite configuration UI to edit the Whisper settings: @@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings: General options. +- **Mode : LOCAL or API** - Choose either local computation or remote API use. - **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin) - **Preload Model** - Keep whisper model loaded. - **Single Utterance Mode** - When enabled recognition stops listening after a single utterance. @@ -139,6 +152,13 @@ Configure whisper options. - **Initial Prompt** - Initial prompt for whisper. - **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) - **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) +- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale. + +### API Configuration + +- **API key** - Optional use of an API key for online services requiring it. +- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API. +- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'. ### Grammar Configuration @@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file Its contents should look similar to: ```ini +org.openhab.voice.whisperstt:mode=LOCAL org.openhab.voice.whisperstt:modelName=tiny +org.openhab.voice.whisperstt:language=en org.openhab.voice.whisperstt:initSilenceSeconds=0.3 org.openhab.voice.whisperstt:removeSilence=true org.openhab.voice.whisperstt:stepSeconds=0.3 @@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false org.openhab.voice.whisperstt:useGrammar=false org.openhab.voice.whisperstt:grammarPenalty=80.0 org.openhab.voice.whisperstt:grammarLines= +org.openhab.voice.whisperstt:apiKey=mykeyaaaa +org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions +org.openhab.voice.whisperstt:apiModelName=whisper-1 ``` ### Default Speech-to-Text Configuration diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java index 0eed735113b4b..57d75afa63e7f 100644 --- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java @@ -146,4 +146,29 @@ public class WhisperSTTConfiguration { * Print whisper.cpp library logs as binding debug logs. */ public boolean enableWhisperLog; + /** + * local to use embedded whisper or openaiapi to use an external API + */ + public Mode mode = Mode.LOCAL; + /** + * If mode set to openaiapi, then use this URL + */ + public String apiUrl = "https://api.openai.com/v1/audio/transcriptions"; + /** + * if mode set to openaiapi, use this api key to access apiUrl + */ + public String apiKey = ""; + /** + * If specified, speed up recognition by avoiding auto-detection + */ + public String language = ""; + /** + * Model name (API only) + */ + public String apiModelName = "whisper-1"; + + public static enum Mode { + LOCAL, + API; + } } diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java index 00d55590d9f50..38d3ea06a03ce 100644 --- a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java @@ -12,12 +12,10 @@ */ package org.openhab.voice.whisperstt.internal; -import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY; -import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID; -import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME; -import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*; import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; @@ -32,7 +30,9 @@ import java.util.Locale; import java.util.Map; import java.util.Set; +import java.util.concurrent.ExecutionException; import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeoutException; import java.util.concurrent.atomic.AtomicBoolean; import javax.sound.sampled.AudioFileFormat; @@ -41,6 +41,13 @@ import org.eclipse.jdt.annotation.NonNullByDefault; import org.eclipse.jdt.annotation.Nullable; +import org.eclipse.jetty.client.HttpClient; +import org.eclipse.jetty.client.api.ContentResponse; +import org.eclipse.jetty.client.api.Request; +import org.eclipse.jetty.client.util.InputStreamContentProvider; +import org.eclipse.jetty.client.util.MultiPartContentProvider; +import org.eclipse.jetty.client.util.StringContentProvider; +import org.eclipse.jetty.http.HttpMethod; import org.openhab.core.OpenHAB; import org.openhab.core.audio.AudioFormat; import org.openhab.core.audio.AudioStream; @@ -48,6 +55,7 @@ import org.openhab.core.common.ThreadPoolManager; import org.openhab.core.config.core.ConfigurableService; import org.openhab.core.config.core.Configuration; +import org.openhab.core.io.net.http.HttpClientFactory; import org.openhab.core.io.rest.LocaleService; import org.openhab.core.voice.RecognitionStartEvent; import org.openhab.core.voice.RecognitionStopEvent; @@ -57,6 +65,7 @@ import org.openhab.core.voice.STTServiceHandle; import org.openhab.core.voice.SpeechRecognitionErrorEvent; import org.openhab.core.voice.SpeechRecognitionEvent; +import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode; import org.openhab.voice.whisperstt.internal.utils.VAD; import org.osgi.framework.Constants; import org.osgi.service.component.annotations.Activate; @@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService { private @Nullable WhisperContext context; private @Nullable WhisperGrammar grammar; private @Nullable WhisperJNI whisper; + private boolean isWhisperLibAlreadyLoaded = false; + private final HttpClientFactory httpClientFactory; @Activate - public WhisperSTTService(@Reference LocaleService localeService) { + public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) { this.localeService = localeService; + this.httpClientFactory = httpClientFactory; } @Activate @@ -108,7 +120,8 @@ protected void activate(Map config) { if (!Files.exists(WHISPER_FOLDER)) { Files.createDirectory(WHISPER_FOLDER); } - WhisperJNI.loadLibrary(getLoadOptions()); + this.config = new Configuration(config).as(WhisperSTTConfiguration.class); + loadWhisperLibraryIfNeeded(); VoiceActivityDetector.loadLibrary(); whisper = new WhisperJNI(); } catch (IOException | RuntimeException e) { @@ -117,6 +130,13 @@ protected void activate(Map config) { configChange(config); } + private void loadWhisperLibraryIfNeeded() throws IOException { + if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) { + WhisperJNI.loadLibrary(getLoadOptions()); + isWhisperLibAlreadyLoaded = true; + } + } + private WhisperJNI.LoadOptions getLoadOptions() { Path libFolder = Paths.get("/usr/local/lib"); Path libFolderWin = Paths.get("/Windows/System32"); @@ -167,14 +187,27 @@ protected void deactivate(Map config) { private void configChange(Map config) { this.config = new Configuration(config).as(WhisperSTTConfiguration.class); - WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null); WhisperGrammar grammar = this.grammar; if (grammar != null) { grammar.close(); this.grammar = null; } + + // API mode + if (this.config.mode == Mode.API) { + try { + unloadContext(); + } catch (IOException e) { + logger.warn("IOException unloading model: {}", e.getMessage()); + } + return; + } + + // Local mode WhisperJNI whisper; try { + loadWhisperLibraryIfNeeded(); + WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null); whisper = getWhisper(); } catch (IOException ignored) { logger.warn("library not loaded, the add-on will not work"); @@ -228,9 +261,17 @@ public String getLabel(@Nullable Locale locale) { @Override public Set getSupportedLocales() { - // as it is not possible to determine the language of the model that was downloaded and setup by the user, it is - // assumed the language of the model is matching the locale of the openHAB server - return Set.of(localeService.getLocale(null)); + // Attempt to create a locale from the configured language + String language = config.language; + Locale modelLocale = localeService.getLocale(null); + if (!language.isBlank()) { + try { + modelLocale = Locale.forLanguageTag(language); + } catch (IllegalArgumentException e) { + logger.warn("Invalid language '{}', defaulting to server locale", language); + } + } + return Set.of(modelLocale); } @Override @@ -246,33 +287,18 @@ public Set getSupportedFormats() { public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set set) throws STTException { AtomicBoolean aborted = new AtomicBoolean(false); - WhisperContext ctx = null; - WhisperState state = null; try { - var whisper = getWhisper(); - ctx = getContext(); - logger.debug("Creating whisper state..."); - state = whisper.initState(ctx); - logger.debug("Whisper state created"); logger.debug("Creating VAD instance..."); - final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE); + final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE); VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep, config.vadStep, config.vadSensitivity); logger.debug("VAD instance created"); sttListener.sttEventReceived(new RecognitionStartEvent()); - backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted); + backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted); } catch (IOException e) { - if (ctx != null && !config.preloadModel) { - ctx.close(); - } - if (state != null) { - state.close(); - } throw new STTException("Exception during initialization", e); } - return () -> { - aborted.set(true); - }; + return () -> aborted.set(true); } private WhisperJNI getWhisper() throws IOException { @@ -339,9 +365,8 @@ private void unloadContext() throws IOException { } } - private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep, - Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) { - var releaseContext = !config.preloadModel; + private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener, + AudioStream audioStream, VAD vad, AtomicBoolean aborted) { final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE; final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE); final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE); @@ -353,21 +378,17 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper logger.debug("Max silence samples {}", nMaxSilenceSamples); // used to store the step samples in libfvad wanted format 16-bit int final short[] stepAudioSamples = new short[nSamplesStep]; - // used to store the full samples in whisper wanted format 32-bit float - final float[] audioSamples = new float[nSamplesMax]; + // used to store the full retained samples for whisper + final short[] audioSamples = new short[nSamplesMax]; executor.submit(() -> { int audioSamplesOffset = 0; int silenceSamplesCounter = 0; int nProcessedSamples = 0; - int numBytesRead; boolean voiceDetected = false; String transcription = ""; - String tempTranscription = ""; - VAD.@Nullable VADResult lastVADResult; VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null; try { - try (state; // - audioStream; // + try (audioStream; // vad) { if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) { AudioWaveUtils.removeFMT(audioStream); @@ -376,10 +397,9 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper .order(ByteOrder.LITTLE_ENDIAN); // init remaining to full capacity int remaining = captureBuffer.capacity(); - WhisperFullParams params = getWhisperFullParams(ctx, locale); while (!aborted.get()) { // read until no remaining so we get the complete step samples - numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining, + int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining, remaining); if (aborted.get() || numBytesRead == -1) { break; @@ -395,17 +415,15 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper while (shortBuffer.hasRemaining()) { var position = shortBuffer.position(); short i16BitSample = shortBuffer.get(); - float f32BitSample = Float.min(1f, - Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f)); stepAudioSamples[position] = i16BitSample; - audioSamples[audioSamplesOffset++] = f32BitSample; + audioSamples[audioSamplesOffset++] = i16BitSample; nProcessedSamples++; } // run vad if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) { logger.debug("VAD: Skipping, max length reached"); } else { - lastVADResult = vad.analyze(stepAudioSamples); + VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples); if (lastVADResult.isVoice()) { voiceDetected = true; logger.debug("VAD: voice detected"); @@ -484,43 +502,26 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper } } } - // run whisper - logger.debug("running whisper with {} seconds of audio...", - Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f); - long execStartTime = System.currentTimeMillis(); - var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset); - logger.debug("whisper ended in {}ms with result code {}", - System.currentTimeMillis() - execStartTime, result); - // process result - if (result != 0) { - emitSpeechRecognitionError(sttListener); - break; - } - int nSegments = whisper.fullNSegmentsFromState(state); - logger.debug("Available transcription segments {}", nSegments); - if (nSegments == 1) { - tempTranscription = whisper.fullGetSegmentTextFromState(state, 0); + // run whisper, either locally or by remote API + String tempTranscription = (switch (config.mode) { + case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage()); + case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage()); + }); + + if (tempTranscription != null && !tempTranscription.isBlank()) { if (config.createWAVRecord) { createAudioFile(audioSamples, audioSamplesOffset, tempTranscription, locale.getLanguage()); } + transcription += tempTranscription; if (config.singleUtteranceMode) { logger.debug("single utterance mode, ending transcription"); - transcription = tempTranscription; break; - } else { - // start a new transcription segment - transcription += tempTranscription; - tempTranscription = ""; } - } else if (nSegments == 0 && config.singleUtteranceMode) { - logger.debug("Single utterance mode and no results, ending transcription"); - break; - } else if (nSegments > 1) { - // non reachable - logger.warn("Whisper should be configured in single segment mode {}", nSegments); + } else { break; } + // reset state to start with next segment voiceDetected = false; silenceSamplesCounter = 0; @@ -528,10 +529,6 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper logger.debug("Partial transcription: {}", tempTranscription); logger.debug("Transcription: {}", transcription); } - } finally { - if (releaseContext) { - ctx.close(); - } } // emit result if (!aborted.get()) { @@ -543,7 +540,7 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper emitSpeechRecognitionNoResultsError(sttListener); } } - } catch (IOException e) { + } catch (STTException | IOException e) { logger.warn("Error running speech to text: {}", e.getMessage()); emitSpeechRecognitionError(sttListener); } catch (UnsatisfiedLinkError e) { @@ -553,7 +550,119 @@ private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, Whisper }); } - private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException { + @Nullable + private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException { + logger.debug("running whisper with {} seconds of audio...", + Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f); + var releaseContext = !config.preloadModel; + + WhisperJNI whisper = null; + WhisperContext ctx = null; + WhisperState state = null; + try { + whisper = getWhisper(); + ctx = getContext(); + logger.debug("Creating whisper state..."); + state = whisper.initState(ctx); + logger.debug("Whisper state created"); + WhisperFullParams params = getWhisperFullParams(ctx, language); + + // convert to local whisper format (float) + float[] floatArray = new float[audioSamples.length]; + for (int i = 0; i < audioSamples.length; i++) { + floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f)); + } + + long execStartTime = System.currentTimeMillis(); + var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset); + logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime, + result); + // process result + if (result != 0) { + throw new STTException("Cannot use whisper locally, result code: " + result); + } + int nSegments = whisper.fullNSegmentsFromState(state); + logger.debug("Available transcription segments {}", nSegments); + if (nSegments == 1) { + return whisper.fullGetSegmentTextFromState(state, 0); + } else if (nSegments == 0 && config.singleUtteranceMode) { + logger.debug("Single utterance mode and no results, ending transcription"); + return null; + } else { + // non reachable + logger.warn("Whisper should be configured in single segment mode {}", nSegments); + return null; + } + } catch (IOException e) { + if (state != null) { + state.close(); + } + throw new STTException("Cannot use whisper locally", e); + } finally { + if (releaseContext && ctx != null) { + ctx.close(); + } + } + } + + private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException { + // convert to byte array, Each short has 2 bytes + int size = audioSamplesOffset * 2; + ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN); + for (int i = 0; i < audioSamplesOffset; i++) { + byteArrayBuffer.putShort(audioStream[i]); + } + javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat( + javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, + false); + byte[] byteArray = byteArrayBuffer.array(); + + try { + AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat, + audioSamplesOffset); + + // write stream as a WAV file, in a byte array stream : + ByteArrayInputStream byteArrayInputStream = null; + try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) { + AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos); + byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray()); + } + + // prepare HTTP request + HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient(); + MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider(); + multiPartContentProvider.addFilePart("file", "audio.wav", + new InputStreamContentProvider(byteArrayInputStream), null); + multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null); + multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null); + multiPartContentProvider.addFieldPart("temperature", + new StringContentProvider(Float.toString(this.config.temperature)), null); + if (!language.isBlank()) { + multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null); + } + Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST) + .content(multiPartContentProvider); + if (!config.apiKey.isBlank()) { + request = request.header("Authorization", "Bearer " + config.apiKey); + } + // execute the request + ContentResponse response = request.send(); + + // check the HTTP status code from the response + int statusCode = response.getStatus(); + if (statusCode < 200 || statusCode >= 300) { + logger.debug("HTTP error: Received status code {}, full error is {}", statusCode, + response.getContentAsString()); + throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode); + } + return response.getContentAsString(); + + } catch (InterruptedException | TimeoutException | ExecutionException | IOException e) { + throw new STTException("Exception during attempt to get speech recognition result from api", e); + } + } + + private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException { WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy); var params = new WhisperFullParams(strategy); params.temperature = config.temperature; @@ -570,7 +679,7 @@ private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale lo params.grammarPenalty = config.grammarPenalty; } // there is no single language models other than the english ones - params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en"; + params.language = getWhisper().isMultilingual(context) ? language : "en"; // implementation assumes this options params.translate = false; params.detectLanguage = false; @@ -605,7 +714,7 @@ private void createSamplesDir() { } } - private void createAudioFile(float[] samples, int size, String transcription, String language) { + private void createAudioFile(short[] samples, int size, String transcription, String language) { createSamplesDir(); javax.sound.sampled.AudioFormat jAudioFormat; ByteBuffer byteBuffer; @@ -615,7 +724,7 @@ private void createAudioFile(float[] samples, int size, String transcription, St WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false); byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN); for (int i = 0; i < size; i++) { - byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE)); + byteBuffer.putShort(samples[i]); } } else { logger.debug("Saving audio file with sample format f32"); @@ -623,7 +732,7 @@ private void createAudioFile(float[] samples, int size, String transcription, St WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false); byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN); for (int i = 0; i < size; i++) { - byteBuffer.putFloat(samples[i]); + byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f))); } } AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()), diff --git a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml index c1f08cfb15c22..e4deb556fd032 100644 --- a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml +++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml @@ -11,7 +11,7 @@ - Configure the VAD mechanisim used to isolate single phrases to feed whisper with. + Configure the VAD mechanism used to isolate single phrases to feed whisper with. @@ -19,7 +19,7 @@ - Define a grammar to improve transcrptions. + Define a grammar to improve transcriptions. @@ -30,9 +30,27 @@ Options added for developers. true + + + Configure OpenAI compatible API, if you don't want to use the local model. + + + + Use the local model or the OpenAI compatible API. + LOCAL + + + + + - - Model name without extension. + + Model name without extension. Local mode only. + + + + If specified, speed up recognition by avoiding auto-detection. Default to system locale. + @@ -225,5 +243,20 @@ false true + + + Key to access the API + + + + + OpenAI compatible API URL. Default to OpenAI transcription service. + https://api.openai.com/v1/audio/transcriptions + + + + Model name to use (API only). Default to OpenAI only available model (whisper-1). + whisper-1 + diff --git a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties index 0780316715b5c..9051bda8e4b99 100644 --- a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties +++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties @@ -3,6 +3,12 @@ addon.whisperstt.name = Whisper Speech-to-Text addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text. +voice.config.whisperstt.apiKey.label = API Key +voice.config.whisperstt.apiKey.description = Key to access the API +voice.config.whisperstt.apiModelName.label = API Model +voice.config.whisperstt.apiModelName.description = Model name to use (API only). Default to OpenAI only available model (whisper-1). +voice.config.whisperstt.apiUrl.label = API Url +voice.config.whisperstt.apiUrl.description = OpenAI compatible API URL. Default to OpenAI transcription service. voice.config.whisperstt.audioContext.label = Audio Context voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size) voice.config.whisperstt.beamSize.label = Beam Size @@ -24,27 +30,35 @@ voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sam voice.config.whisperstt.group.developer.label = Developer voice.config.whisperstt.group.developer.description = Options added for developers. voice.config.whisperstt.group.grammar.label = Grammar -voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions. +voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcriptions. voice.config.whisperstt.group.messages.label = Info Messages voice.config.whisperstt.group.messages.description = Configure service information messages. +voice.config.whisperstt.group.openaiapi.label = API Configuration Options +voice.config.whisperstt.group.openaiapi.description = Configure OpenAI compatible API, if you don't want to use the local model. voice.config.whisperstt.group.stt.label = STT Configuration voice.config.whisperstt.group.stt.description = Configure Speech to Text. voice.config.whisperstt.group.vad.label = Voice Activity Detection -voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with. +voice.config.whisperstt.group.vad.description = Configure the VAD mechanism used to isolate single phrases to feed whisper with. voice.config.whisperstt.group.whisper.label = Whisper Options voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options. voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription. voice.config.whisperstt.initialPrompt.label = Initial Prompt voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with. +voice.config.whisperstt.language.label = Language +voice.config.whisperstt.language.description = If specified, speed up recognition by avoiding auto-detection. Default to system locale. voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection. voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription. voice.config.whisperstt.minSeconds.label = Min Transcription Seconds voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper. -voice.config.whisperstt.modelName.label = Model Name -voice.config.whisperstt.modelName.description = Model name without extension. +voice.config.whisperstt.mode.label = Local Mode Or API +voice.config.whisperstt.mode.description = Use the local model or the OpenAI compatible API. +voice.config.whisperstt.mode.option.LOCAL = Local +voice.config.whisperstt.mode.option.API = OpenAI API +voice.config.whisperstt.modelName.label = Local Model Name +voice.config.whisperstt.modelName.description = Model name without extension. Local mode only. voice.config.whisperstt.openvinoDevice.label = OpenVINO Device voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) voice.config.whisperstt.preloadModel.label = Preload Model