A heavily simplified Vega clone.

Demo Video

Note that its far from perfect, mainly a proof of concept :)

https://drive.google.com/file/d/1kgmrHM6PUAV_y5KIn9hWwr2nyswX4YXk/view?usp=sharing

Abilities:

Only responds after hearing wakephrase ("Hey Computer...")
Answer questions
Take picture
Add/Remove Todos

STT and Reasoning

The "distil-whisper-large-v3-en" model powers the STT for the application. The incoming audio stream is sliced using the Silero-vad model and web-vad before being sent to Groq for STT. Extracted text is then handled using the "llama-3.1-70b-versatile" model on Groq. The model is informed of its abilities (take picture, etc) and instructed to use specific tokens (ex: <TAKE_PICTURE>) if the action is requested by the user.

Note that all calls to the Groq API are heavily compartmentalized into seperate functions. This was designed to all for easy swapping between model API and potentially to local models.

Haar Cascade Classifier

A haar cascade classifier was trained to implement hand recogitnion in the dataset. Hagrid dataset was used for positive hand pictures. "Random Images for Image classification" dataset was used for negative images.

OpenCV was used for the training process. OpenCVJS was used in the initial implementation before getting removed due to model accuracy.

Model Accuracy: bad

Post mortem: After training the model and testing with the live video feed the accuracy was poor. I believe this was due to an insufficient amount of negative images. The ratio of positive to negative was 4:1; I believe this led to the model having a poor understanding of what is not a hand. This could be seen by the model overly classifying random items in the background as a hand during testing.

If I was to train this model again I would greatly up the amount of negative images and also make them more "relevant" to the camera feed. Many of the images were extremely unlikely to appear in the live camear feed (ex: random nature pictures). A dataset of random people sitting infront of their camera without showing their hands would be a better negative image dataset.

Running

Git clone the repo
Add a GROQ_API_KEY to .env file. https://console.groq.com/keys
npm install
npm run dev
Talk to the computer with "Hey Computer..."

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
app		app
model		model
public		public
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A heavily simplified Vega clone.

Demo Video

Abilities:

STT and Reasoning

Haar Cascade Classifier

Running

About

Releases

Packages

Languages

Beasleydog/clawsdemo

Folders and files

Latest commit

History

Repository files navigation

A heavily simplified Vega clone.

Demo Video

Abilities:

STT and Reasoning

Haar Cascade Classifier

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages