Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Web Speech API #661

Open
wants to merge 4 commits into
base: v2-dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pages/info/debug.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ import { ROUTE_APP_CHAT, ROUTE_INDEX } from '~/common/app.routes';
import { incrementalNewsVersion, useAppNewsStateStore } from '../../src/apps/news/news.version';

// capabilities access
import { useCapabilityBrowserSpeechRecognition, useCapabilityElevenLabs, useCapabilityTextToImage } from '~/common/components/useCapabilities';
import { useCapabilityBrowserSpeechRecognition, useVoiceCapability, useCapabilityTextToImage } from '~/common/components/useCapabilities';

// stores access
import { getLLMsDebugInfo } from '~/common/stores/llms/store-llms';
Expand Down Expand Up @@ -96,7 +96,7 @@ function AppDebug() {
const cProduct = {
capabilities: {
mic: useCapabilityBrowserSpeechRecognition(),
elevenLabs: useCapabilityElevenLabs(),
elevenLabs: useVoiceCapability(),
textToImage: useCapabilityTextToImage(),
},
models: getLLMsDebugInfo(),
Expand Down
4 changes: 2 additions & 2 deletions src/apps/call/CallWizard.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ import WarningRoundedIcon from '@mui/icons-material/WarningRounded';
import { animationColorRainbow } from '~/common/util/animUtils';
import { navigateBack } from '~/common/app.routes';
import { optimaOpenPreferences } from '~/common/layout/optima/useOptima';
import { useCapabilityBrowserSpeechRecognition, useCapabilityElevenLabs } from '~/common/components/useCapabilities';
import { useCapabilityBrowserSpeechRecognition, useVoiceCapability } from '~/common/components/useCapabilities';
import { useChatStore } from '~/common/stores/chat/store-chats';
import { useUICounter } from '~/common/state/store-ui';

Expand Down Expand Up @@ -45,7 +45,7 @@ export function CallWizard(props: { strict?: boolean, conversationId: string | n

// external state
const recognition = useCapabilityBrowserSpeechRecognition();
const synthesis = useCapabilityElevenLabs();
const synthesis = useVoiceCapability();
const chatIsEmpty = useChatStore(state => {
if (!props.conversationId)
return false;
Expand Down
17 changes: 13 additions & 4 deletions src/apps/call/Telephone.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import { ScrollToBottom } from '~/common/scroll-to-bottom/ScrollToBottom';
import { ScrollToBottomButton } from '~/common/scroll-to-bottom/ScrollToBottomButton';
import { useChatLLMDropdown } from '../chat/components/layout-bar/useLLMDropdown';

import { EXPERIMENTAL_speakTextStream } from '~/modules/elevenlabs/elevenlabs.client';
import { EXPERIMENTAL_speakTextStream } from '~/common/components/useVoiceCapabilities';
import { SystemPurposeId, SystemPurposes } from '../../data';
import { llmStreamingChatGenerate, VChatMessageIn } from '~/modules/llms/llm.client';
import { useElevenLabsVoiceDropdown } from '~/modules/elevenlabs/useElevenLabsVoiceDropdown';
Expand Down Expand Up @@ -245,13 +245,22 @@ export function Telephone(props: {
// perform completion
responseAbortController.current = new AbortController();
let finalText = '';
let currentSentence = '';
let error: any | null = null;
setPersonaTextInterim('💭...');
llmStreamingChatGenerate(chatLLMId, callPrompt, 'call', callMessages[0].id, null, null, responseAbortController.current.signal, ({ textSoFar }) => {
const text = textSoFar?.trim();
if (text) {
finalText = text;
setPersonaTextInterim(text);

// Maintain and say the current sentence
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this.

if (/[.,!?]$/.test(text)) {
currentSentence = text.substring(finalText?.length)
finalText = text
if (currentSentence?.length >= 1)
void EXPERIMENTAL_speakTextStream(currentSentence, personaVoiceId);
}
currentSentence = text.substring(finalText?.length) // to be added to the final text
}
}).catch((err: DOMException) => {
if (err?.name !== 'AbortError')
Expand All @@ -261,8 +270,8 @@ export function Telephone(props: {
if (finalText || error)
setCallMessages(messages => [...messages, createDMessageTextContent('assistant', finalText + (error ? ` (ERROR: ${error.message || error.toString()})` : ''))]); // [state] append assistant:call_response
// fire/forget
if (finalText?.length >= 1)
void EXPERIMENTAL_speakTextStream(finalText, personaVoiceId);
if (currentSentence?.length >= 1)
void EXPERIMENTAL_speakTextStream(currentSentence, personaVoiceId);
});

return () => {
Expand Down
2 changes: 1 addition & 1 deletion src/apps/chat/AppChat.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import { FlattenerModal } from '~/modules/aifn/flatten/FlattenerModal';
import { TradeConfig, TradeModal } from '~/modules/trade/TradeModal';
import { downloadSingleChat, importConversationsFromFilesAtRest, openConversationsAtRestPicker } from '~/modules/trade/trade.client';
import { imaginePromptFromTextOrThrow } from '~/modules/aifn/imagine/imaginePromptFromText';
import { speakText } from '~/modules/elevenlabs/elevenlabs.client';
import { speakText } from '~/common/components/useVoiceCapabilities';
import { useAreBeamsOpen } from '~/modules/beam/store-beam.hooks';
import { useCapabilityTextToImage } from '~/modules/t2i/t2i.client';

Expand Down
4 changes: 2 additions & 2 deletions src/apps/chat/components/ChatMessageList.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import { getConversation, useChatStore } from '~/common/stores/chat/store-chats'
import { openFileForAttaching } from '~/common/components/ButtonAttachFiles';
import { optimaOpenPreferences } from '~/common/layout/optima/useOptima';
import { useBrowserTranslationWarning } from '~/common/components/useIsBrowserTranslating';
import { useCapabilityElevenLabs } from '~/common/components/useCapabilities';
import { useVoiceCapability } from '~/common/components/useCapabilities';
import { useChatOverlayStore } from '~/common/chat-overlay/store-chat-overlay';
import { useScrollToBottom } from '~/common/scroll-to-bottom/useScrollToBottom';

Expand Down Expand Up @@ -75,7 +75,7 @@ export function ChatMessageList(props: {
_composerInReferenceToCount: state.inReferenceTo?.length ?? 0,
ephemerals: state.ephemerals?.length ? state.ephemerals : null,
})));
const { mayWork: isSpeakable } = useCapabilityElevenLabs();
const { mayWork: isSpeakable } = useVoiceCapability();

// derived state
const { conversationHandler, conversationId, capabilityHasT2I, onConversationBranch, onConversationExecuteHistory, onTextDiagram, onTextImagine, onTextSpeak } = props;
Expand Down
20 changes: 20 additions & 0 deletions src/apps/chat/store-app-chat.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { create } from 'zustand';
import { persist } from 'zustand/middleware';
import { useShallow } from 'zustand/react/shallow';
import { ASREngineList, TTSEngineList } from '~/common/components/useVoiceCapabilities';

import type { DLLMId } from '~/common/stores/llms/llms.types';

Expand Down Expand Up @@ -51,6 +52,12 @@ interface AppChatStore {
micTimeoutMs: number;
setMicTimeoutMs: (micTimeoutMs: number) => void;

TTSEngine: string;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now this could be: TTSEngine: 'elevenlabs' | 'webspeech', to force typescript to do its job.

setTTSEngine: (TTSEngine: string) => void;

ASREngine: string;
setASREngine: (ASREngine: string) => void;

showPersonaIcons: boolean;
setShowPersonaIcons: (showPersonaIcons: boolean) => void;

Expand Down Expand Up @@ -114,6 +121,12 @@ const useAppChatStore = create<AppChatStore>()(persist(
micTimeoutMs: 2000,
setMicTimeoutMs: (micTimeoutMs: number) => _set({ micTimeoutMs }),

TTSEngine: TTSEngineList[0],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TTSEngine: 'elevenlabs' | 'webspeech', then this become one of the two values (probably 'WebSpeech' by default) -- then the conversion to a nice string can be done in the settings UI, and in the code we only match against those IDs.

As an alternative this could be left undefined, and the UI will decide what to use every time, unles the user makes a choice. undefined will default to 'webspeech'

setTTSEngine: (TTSEngine: string) => _set({ TTSEngine }),

ASREngine: ASREngineList[0],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, we could keep an undefined here, and hardcode a 'webspeech' as the ID - so we can fall back to that as autodetect

setASREngine: (ASREngine: string) => _set({ ASREngine }),

showPersonaIcons: true,
setShowPersonaIcons: (showPersonaIcons: boolean) => _set({ showPersonaIcons }),

Expand Down Expand Up @@ -198,6 +211,13 @@ export const useChatMicTimeoutMsValue = (): number =>
export const useChatMicTimeoutMs = (): [number, (micTimeoutMs: number) => void] =>
useAppChatStore(useShallow(state => [state.micTimeoutMs, state.setMicTimeoutMs]));

export const useTTSEngine = (): [string, (micTimeoutMs: string) => void] =>
useAppChatStore(useShallow(state => [state.TTSEngine, state.setTTSEngine]));
export const getTTSEngine = () => useAppChatStore.getState().TTSEngine;

export const useASREngine = (): [string, (micTimeoutMs: string) => void] =>
useAppChatStore(useShallow(state => [state.ASREngine, state.setASREngine]));

export const useChatDrawerFilters = () => {
const values = useAppChatStore(useShallow(state => ({
filterHasDocFragments: state.filterHasDocFragments,
Expand Down
12 changes: 10 additions & 2 deletions src/apps/settings-modal/SettingsModal.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ import { AppChatSettingsAI } from './AppChatSettingsAI';
import { AppChatSettingsUI } from './settings-ui/AppChatSettingsUI';
import { UxLabsSettings } from './UxLabsSettings';
import { VoiceSettings } from './VoiceSettings';
import { BrowserSpeechSettings } from '~/modules/browser/speech-synthesis/BrowserSpeechSettings';

import { useTTSEngine } from 'src/apps/chat/store-app-chat';


// styled <AccordionGroup variant='plain'> into a Topics component
Expand Down Expand Up @@ -122,6 +125,8 @@ export function SettingsModal(props: {
// external state
const isMobile = useIsMobile();

const [TTSEngine] = useTTSEngine()

// handlers

const { setTab } = props;
Expand Down Expand Up @@ -193,9 +198,12 @@ export function SettingsModal(props: {
<Topic icon='🎙️' title='Voice settings'>
<VoiceSettings />
</Topic>
<Topic icon='📢' title='ElevenLabs API'>
{TTSEngine === 'Elevenlabs' && <Topic icon='📢' title='ElevenLabs API'>
<ElevenlabsSettings />
</Topic>
</Topic>}
{TTSEngine === 'Web Speech API' && <Topic icon='📢' title='Web Speech API'>
<BrowserSpeechSettings />
</Topic>}
</Topics>
</TabPanel>

Expand Down
27 changes: 22 additions & 5 deletions src/apps/settings-modal/VoiceSettings.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,25 @@ import * as React from 'react';

import { FormControl } from '@mui/joy';

import { useChatAutoAI, useChatMicTimeoutMs } from '../chat/store-app-chat';
import { useASREngine, useChatAutoAI, useChatMicTimeoutMs, useTTSEngine } from '../chat/store-app-chat';


import { useElevenLabsVoices } from '~/modules/elevenlabs/useElevenLabsVoiceDropdown';

import { FormLabelStart } from '~/common/components/forms/FormLabelStart';
import { FormRadioControl } from '~/common/components/forms/FormRadioControl';
import { LanguageSelect } from '~/common/components/LanguageSelect';
import { useIsMobile } from '~/common/components/useMatchMedia';

import { hasVoices, ASREngineList, TTSEngineList } from '~/common/components/useVoiceCapabilities';

export function VoiceSettings() {

// external state
const isMobile = useIsMobile();
const { autoSpeak, setAutoSpeak } = useChatAutoAI();
const { hasVoices } = useElevenLabsVoices();
const [chatTimeoutMs, setChatTimeoutMs] = useChatMicTimeoutMs();

const [chatTimeoutMs, setChatTimeoutMs] = useChatMicTimeoutMs();
const [TTSEngine, setTTSEngine ] = useTTSEngine();
const [ASREngine, setASREngine ] = useASREngine();

// this converts from string keys to numbers and vice versa
const chatTimeoutValue: string = '' + chatTimeoutMs;
Expand Down Expand Up @@ -59,5 +60,21 @@ export function VoiceSettings() {
value={autoSpeak} onChange={setAutoSpeak}
/>

<FormRadioControl
title='TTS engine'
description='Text to speech'
tooltip=''
options={TTSEngineList.map((i) => ({ value: i, label: i }))}
value={TTSEngine} onChange={setTTSEngine}
/>

<FormRadioControl
title='ASR engine'
description='Automatic Speech Recognition'
tooltip=''
options={ASREngineList.map((i) => ({ value: i, label: i }))}
value={ASREngine} onChange={setASREngine}
/>

</>;
}
6 changes: 3 additions & 3 deletions src/common/components/useCapabilities.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ export interface CapabilityBrowserSpeechRecognition {
export { browserSpeechRecognitionCapability as useCapabilityBrowserSpeechRecognition } from './useSpeechRecognition';


/// Speech Synthesis: ElevenLabs
/// Speech Synthesis

export interface CapabilityElevenLabsSpeechSynthesis {
export interface CapabilitySpeechSynthesis {
mayWork: boolean;
isConfiguredServerSide: boolean;
isConfiguredClientSide: boolean;
}

export { useCapability as useCapabilityElevenLabs } from '~/modules/elevenlabs/elevenlabs.client';
export { useCapability as useVoiceCapability } from '~/common/components/useVoiceCapabilities';


/// Image Generation
Expand Down
74 changes: 74 additions & 0 deletions src/common/components/useVoiceCapabilities.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import { getTTSEngine } from 'src/apps/chat/store-app-chat';
import { CapabilitySpeechSynthesis } from '~/common/components/useCapabilities';

import { useCapability as useElevenlabsCapability } from '~/modules/elevenlabs/elevenlabs.client'
import { speakText as elevenlabsSpeakText } from '~/modules/elevenlabs/elevenlabs.client'
import { EXPERIMENTAL_speakTextStream as EXPERIMENTAL_elevenlabsSpeakTextStream } from '~/modules/elevenlabs/elevenlabs.client'

import { useCapability as useBrowserSpeechSynthesisCapability } from '~/modules/browser/speech-synthesis/browser.speechSynthesis.client'
import { speakText as browserSpeechSynthesisSpeakText } from '~/modules/browser/speech-synthesis/browser.speechSynthesis.client'
import { EXPERIMENTAL_speakTextStream as EXPERIMENTAL_browserSpeechSynthesisSpeakTextStream } from '~/modules/browser/speech-synthesis/browser.speechSynthesis.client'

import { useElevenLabsVoices } from '~/modules/elevenlabs/useElevenLabsVoiceDropdown';
import { useBrowserSpeechVoices } from '~/modules/browser/speech-synthesis/useBrowserSpeechVoiceDropdown';

export const TTSEngineList: string[] = [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other note on this. I wonder if this would be better served as a map:

export type TTSEngineKey = 'elevenlabs' | 'webspeech';

const TTSEngineList: { [key in TTSEngineKey]: { label: string, priority: number } } = {
  'elevenlabs': {
    label: 'ElevenLabs',
    priority: 2,
  },
  'webspeech': {
    label: 'Web Speech API',
    priority: 1,
  },
};

I've added an attribute called 'priority' to show how we can extend it in the future, for instance with default configurations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a try. I think use map for extending is a good idea, but i suggest use a list contained map:

export type TTSEngineKey = 'Elevenlabs' | 'Web Speech API';
export type ASREngineKey = 'Web Speech API';

export const TTSEngineList: { key: TTSEngineKey; label: string }[] = [
  {
    key: 'Elevenlabs',
    label: 'ElevenLabs',
  },
  {
    key: 'Web Speech API',
    label: 'Web Speech API',
  },
];

because there are only 2 case using the variable now, one is <FormRadioControl> in VoiceSettings.tsx another one is store-app-chat.ts. <FormRadioControl> need Iterating and VoiceSettings.tsx need a default value. These are map can't do.

'Elevenlabs',
'Web Speech API'
]

export const ASREngineList: string[] = [
'Web Speech API'
]

export function getConditionalVoices(){
const TTSEngine = getTTSEngine();
if (TTSEngine === 'Elevenlabs') {
return useElevenLabsVoices
}else if (TTSEngine === 'Web Speech API') {
return useBrowserSpeechVoices
}
throw new Error('TTSEngine is not found');
}

export function hasVoices(): boolean {
console.log('getConditionalVoices', getConditionalVoices()().hasVoices)
return getConditionalVoices()().hasVoices;
}

export function getConditionalCapability(): () => CapabilitySpeechSynthesis {
const TTSEngine = getTTSEngine();
if (TTSEngine === 'Elevenlabs') {
return useElevenlabsCapability
}else if (TTSEngine === 'Web Speech API') {
return useBrowserSpeechSynthesisCapability
}
throw new Error('TTSEngine is not found');
}

export function useCapability(): CapabilitySpeechSynthesis {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash issue identified to this hook (the one that gave the black screen in the screenshot). Seems that when switching provider, there's a react out-of-order issue. Only when switching TTS providers I believe.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that to fix this properly, we may have to overhaul the ttsengine reactivity (hooks)

return getConditionalCapability()();
}


export async function speakText(text: string, voiceId?: string) {
const TTSEngine = getTTSEngine();
if (TTSEngine === 'Elevenlabs') {
return await elevenlabsSpeakText(text, voiceId);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way of doing it. I wonder if we should have a common Interface, and have all the TTS providers (e.g. OpenAI has 2 as well, one via their TTS and one in the new [audio] models), and there's Play.ht and more.

A way to reduce the switch-cases (or ifs) will be to have a common interface, such as ISpeechSynthesis, that will be returned by getTTSEngine (basically instead of the string it would return an object that implements the interface).

Just a thought.

}else if (TTSEngine === 'Web Speech API') {
return await browserSpeechSynthesisSpeakText(text, voiceId);
}
throw new Error('TTSEngine is not found');
}

// let liveAudioPlayer: LiveAudioPlayer | undefined = undefined;

export async function EXPERIMENTAL_speakTextStream(text: string, voiceId?: string) {
const TTSEngine = getTTSEngine();
if (TTSEngine === 'Elevenlabs') {
return await EXPERIMENTAL_elevenlabsSpeakTextStream(text, voiceId);
}else if (TTSEngine === 'Web Speech API') {
return await EXPERIMENTAL_browserSpeechSynthesisSpeakTextStream(text, voiceId);
}
throw new Error('TTSEngine is not found');
}
Loading