Skip to content

Commit

Permalink
Merge pull request #87 from homebrewltd/eckartal-patch-11
Browse files Browse the repository at this point in the history
Update llama-learns-to-talk.mdx
  • Loading branch information
bachvudinh authored Oct 7, 2024
2 parents 82e9c73 + 3bb5f14 commit 1f8d5f8
Showing 1 changed file with 15 additions and 23 deletions.
38 changes: 15 additions & 23 deletions src/pages/blog/llama-learns-to-talk.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "🍓 Ichigo: Llama Learns to Talk"
title: "🍓 Ichigo: Llama learns to talk"
authorURL: https://twitter.com/homebrewltd
description: Homebrew research shares llama3-s, an open & ongoing research experiment to teach llama3 to listen.
description: "With the latest checkpoint, we're teaching AI to talk and recognize when it can't listen."
tags: llama3, multimodal, llm, "ai training", speech
categories: research
ogImage: assets/images/og/llama-evolved.png
Expand All @@ -19,15 +19,15 @@ import ResearchCTABlog from '@/components/Blog/ResearchCTABlog'

<BlogAuthors authors={["Alan Dao", "Rex Ha", "Bach Vu"]}/>

<Callout>
Homebrew’s early-fusion speech model has evolved. Meet 🍓 Ichigo  - the latest llama3-s checkpoint.
<Callout emoji="🍓">
Our homebrewed early-fusion speech model has evolved. Meet 🍓 Ichigo - the latest llama3-s checkpoint, that we're experimenting to teach Llama3 talk and recognize when it can't understand the speech.
</Callout>

Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](https://arxiv.org/pdf/2407.21783) papers, llama3-s is an early-fusion, audio and text, multimodal model. We're conducting this research entirely in the open, with an open-source [codebase](https://github.com/homebrewltd/llama3-s), [open data](https://huggingface.co/datasets/homebrewltd/instruction-speech-v1.5) and [open weights](https://huggingface.co/homebrewltd/llama3-s-2024-07-19).
Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](https://arxiv.org/pdf/2407.21783) papers, llama3-s (Ichigo) is an early-fusion, audio and text, multimodal model. We're conducting this research entirely in the open, with an open-source [codebase](https://github.com/homebrewltd/ichigo), [open data](https://huggingface.co/datasets/homebrewltd/instruction-speech-v1.5) and [open weights](https://huggingface.co/homebrewltd/llama3-s-2024-07-19).

<br></br>

![Llama evolves into 🍓 Ichigo.](./_assets/ichigov0.3/ichigo.png)
![Llama learns to talk](./_assets/ichigov0.3/ichigo.png)
*Image generated by ChatGPT*

## Demo
Expand All @@ -39,15 +39,15 @@ Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](h
**You can try it for yourself:**

- Via our [self-hosted demo here](https://ichigo.homebrew.ltd/)*
- Via [Hugging Face demo here](https://huggingface.co/spaces/jan-hq/Llama3.1-s-v0.2-checkpoint-2024-08-20)
- Via [Hugging Face demo here](https://huggingface.co/spaces/jan-hq/Ichigo-llama3.1-s-instruct)
- [Build it from scratch](https://github.com/homebrewltd/llama3-s)
- [Download Ichigo family](https://huggingface.co/collections/homebrewltd/ichigo-66ffc7484ef31ec5596ef6d0)

**Inference may slow/queued due to shared compute on a single Nvidia RTX4090*
**Inference may slow/queued due to shared compute on a single NVIDIA RTX4090*

This post shares methodology and results behind this latest checkpoint. As always, this is just the beginning, and we need your ideas to push this research further.

## Change log
## Changelog

From the [llama3-s-v0.2 checkpoint](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2), we identified several areas for improvement:

Expand Down Expand Up @@ -118,10 +118,8 @@ Spoiler alert: We recovered MMLU performance from 0.42 to **0.63**, reducing the
| --- | --- | --- | --- | --- | --- | --- |
| Test 1: Early Pretrain Recovery | 3,000 steps | 500k mixed |||| 0.55 |
| Test 2: Late Pretrain Recovery | Last | 500k mixed |||| 0.515 |
| Test 3: Late Pretrain Recovery with Transcription | Last | 500k mixed | ✅ | ✅ | ✅
(With transcription token) | 0.48 |
| Test 4: Extended Late Pretrain Recovery | Last | 1.89M mixed | ✅ | ✅ | ✅
(With transcription prompts) | 0.61 |
| Test 3: Late Pretrain Recovery with Transcription (With transcription token) | Last | 500k mixed |||| 0.48 |
| Test 4: Extended Late Pretrain Recovery (With transcription prompts) | Last | 1.89M mixed |||| 0.63 |

**Mixed training data between modalities:** We determined an optimal interleaving of different data types with 70% speech instruction prompts, 20% speech transcription prompts and 10% text-only prompts.

Expand Down Expand Up @@ -212,9 +210,7 @@ Beyond randomizing sound tokens for inaudible input, we also performed sequence
**AudioBench Eval**: [AudioBench](https://arxiv.org/abs/2406.16020) is a June 2024 benchmark designed to evaluate audio large language models (AudioLLMs). It measures speech capabilities, in addition to ASR, transcription, etc., through a compilation of many open datasets.

| Model Bench | [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test)
(GPT-4-O judge 0:5) | [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test)
(GPT-4-O judge 0:5) |
| Model Bench | [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test) (GPT-4-O judge 0:5) | [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test) (GPT-4-O judge 0:5) |
| --- | --- | --- |
| [Llama3.1-s-v2](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2) | 3.45 | 3.53 |
| [Ichigo-llama3.1-s v0.3-phase2 -cp7000](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2) | 3.42 | 3.62 |
Expand Down Expand Up @@ -243,15 +239,11 @@ For now, our next steps are as follows:

| Task Type | v0.2 | v0.3 |
| --- | --- | --- |
| Speech Multi-turn | None | 140K samples: 2 turns
10K samples >= 4 turns |
| Speech Multi-turn | None | 140K samples: 2 turns, 10K samples >= 4 turns |
| Speech QA | 679K samples | 1.33M samples |
| Transcription | 250K samples
(Using a special token) | 400K samples
(6 different prompts) |
| Transcription | 250K samples (Using a special token) | 400K samples (6 different prompts) |
| Noise Audio | None | 8K samples |
| Text-only | None | 100K samples: multi-turn
50K samples: single turn |
| Text-only | None | 100K samples: multi-turn, 50K samples: single turn |

**Prompts used for transcription data**

Expand Down

0 comments on commit 1f8d5f8

Please sign in to comment.