Merge pull request #87 from homebrewltd/eckartal-patch-11

Update llama-learns-to-talk.mdx
janhq · Oct 7, 2024 · 1f8d5f8 · 1f8d5f8
2 parents 82e9c73 + 3bb5f14
commit 1f8d5f8
Showing 1 changed file with 15 additions and 23 deletions.
diff --git a/src/pages/blog/llama-learns-to-talk.mdx b/src/pages/blog/llama-learns-to-talk.mdx
@@ -1,7 +1,7 @@
 ---
-title: "🍓 Ichigo: Llama Learns to Talk"
+title: "🍓 Ichigo: Llama learns to talk"
 authorURL: https://twitter.com/homebrewltd
-description: Homebrew research shares llama3-s, an open & ongoing research experiment to teach llama3 to listen.
+description: "With the latest checkpoint, we're teaching AI to talk and recognize when it can't listen."
 tags: llama3, multimodal, llm, "ai training", speech
 categories: research
 ogImage: assets/images/og/llama-evolved.png
@@ -19,15 +19,15 @@ import ResearchCTABlog from '@/components/Blog/ResearchCTABlog'
 
 <BlogAuthors authors={["Alan Dao", "Rex Ha", "Bach Vu"]}/>
 
-<Callout>
-Homebrew’s early-fusion speech model has evolved. Meet 🍓 Ichigo  - the latest llama3-s checkpoint.
+<Callout emoji="🍓">
+Our homebrewed early-fusion speech model has evolved. Meet 🍓 Ichigo - the latest llama3-s checkpoint, that we're experimenting to teach Llama3 talk and recognize when it can't understand the speech. 
 </Callout>
 
-Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](https://arxiv.org/pdf/2407.21783) papers, llama3-s is an early-fusion, audio and text, multimodal model. We're conducting this research entirely in the open, with an open-source [codebase](https://github.com/homebrewltd/llama3-s), [open data](https://huggingface.co/datasets/homebrewltd/instruction-speech-v1.5) and [open weights](https://huggingface.co/homebrewltd/llama3-s-2024-07-19).
+Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](https://arxiv.org/pdf/2407.21783) papers, llama3-s (Ichigo) is an early-fusion, audio and text, multimodal model. We're conducting this research entirely in the open, with an open-source [codebase](https://github.com/homebrewltd/ichigo), [open data](https://huggingface.co/datasets/homebrewltd/instruction-speech-v1.5) and [open weights](https://huggingface.co/homebrewltd/llama3-s-2024-07-19).
 
 <br></br>
 
-![Llama evolves into 🍓 Ichigo.](./_assets/ichigov0.3/ichigo.png)
+![Llama learns to talk](./_assets/ichigov0.3/ichigo.png)
 *Image generated by ChatGPT*
 
 ## Demo
@@ -39,15 +39,15 @@ Inspired by the [Chameleon](https://arxiv.org/pdf/2405.09818) and [Llama Herd](h
 **You can try it for yourself:**
 
 - Via our [self-hosted demo here](https://ichigo.homebrew.ltd/)*
-- Via [Hugging Face demo here](https://huggingface.co/spaces/jan-hq/Llama3.1-s-v0.2-checkpoint-2024-08-20)
+- Via [Hugging Face demo here](https://huggingface.co/spaces/jan-hq/Ichigo-llama3.1-s-instruct)
 - [Build it from scratch](https://github.com/homebrewltd/llama3-s)
 - [Download Ichigo family](https://huggingface.co/collections/homebrewltd/ichigo-66ffc7484ef31ec5596ef6d0)
 
-**Inference may slow/queued due to shared compute on a single Nvidia RTX4090*
+**Inference may slow/queued due to shared compute on a single NVIDIA RTX4090*
 
 This post shares methodology and results behind this latest checkpoint. As always, this is just the beginning, and we need your ideas to push this research further.
 
-## Change log
+## Changelog
 
 From the [llama3-s-v0.2 checkpoint](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2), we identified several areas for improvement:
 
@@ -118,10 +118,8 @@ Spoiler alert: We recovered MMLU performance from 0.42 to **0.63**, reducing the
 | --- | --- | --- | --- | --- | --- | --- |
 | Test 1: Early Pretrain Recovery | 3,000 steps | 500k mixed | ✅ | ✅ | ❌ | 0.55 |
 | Test 2: Late Pretrain Recovery | Last | 500k mixed | ✅ | ✅ | ❌ | 0.515 |
-| Test 3: Late Pretrain Recovery with Transcription | Last | 500k mixed | ✅ | ✅ | ✅
-(With transcription token) | 0.48 |
-| Test 4: Extended Late Pretrain Recovery | Last | 1.89M mixed | ✅ | ✅ | ✅
-(With transcription prompts) | 0.61 |
+| Test 3: Late Pretrain Recovery with Transcription (With transcription token) | Last | 500k mixed | ✅ | ✅ | ✅ | 0.48 |
+| Test 4: Extended Late Pretrain Recovery (With transcription prompts) | Last | 1.89M mixed | ✅ | ✅ | ✅| 0.63 |
 
 **Mixed training data between modalities:** We determined an optimal interleaving of different data types with 70% speech instruction prompts, 20% speech transcription prompts and 10% text-only prompts.
 
@@ -212,9 +210,7 @@ Beyond randomizing sound tokens for inaudible input, we also performed sequence
 
 **AudioBench Eval**: [AudioBench](https://arxiv.org/abs/2406.16020) is a June 2024 benchmark designed to evaluate audio large language models (AudioLLMs). It measures speech capabilities, in addition to ASR, transcription, etc., through a compilation of many open datasets.
 
-| Model Bench | [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test)
-(GPT-4-O judge 0:5) | [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test)
-(GPT-4-O judge 0:5) |
+| Model Bench | [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test) (GPT-4-O judge 0:5) | [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test) (GPT-4-O judge 0:5) |
 | --- | --- | --- |
 | [Llama3.1-s-v2](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2) | 3.45 | 3.53 |
 | [Ichigo-llama3.1-s v0.3-phase2 -cp7000](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2) | 3.42 | 3.62 |
@@ -243,15 +239,11 @@ For now, our next steps are as follows:
 
 | Task Type | v0.2 | v0.3 |
 | --- | --- | --- |
-| Speech Multi-turn | None | 140K samples: 2 turns
-10K samples >= 4 turns |
+| Speech Multi-turn | None | 140K samples: 2 turns, 10K samples >= 4 turns |
 | Speech QA | 679K samples | 1.33M samples |
-| Transcription | 250K samples
-(Using a special token) | 400K samples
-(6 different prompts) |
+| Transcription | 250K samples (Using a special token) | 400K samples (6 different prompts) |
 | Noise Audio | None | 8K samples |
-| Text-only | None | 100K samples: multi-turn
-50K samples: single turn |
+| Text-only | None | 100K samples: multi-turn, 50K samples: single turn |
 
 **Prompts used for transcription data**