Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

civvic · 2024-05-25T11:06:15Z

I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.

The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.

mlx_vlm version: "0.0.6", dev install
Model Used: mlx-community/idefics2-8b-4bit (also tested with 8bit)

Code Snippet:

model, processor = load("mlx-community/idefics2-8b-4bit")
prompt_text_tmpl = "Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English."
resulting_messages = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text_tmpl}]}
]
prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True)
output = generate(model, processor, image, prompt, temp=0.4, max_tokens=512, top_p=0.8, verbose=True)

The expected output should closely match the results from other environments, such as:

["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]

I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.

The output from mlx_vlm is significantly different and less accurate:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x537C49490> 

Prompt: User:<image>Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English.<end_of_utterance>
Assistant:
The text consists of a single word: "down".<end_of_utterance>
==========
Prompt: 76.531 tokens-per-sec
Generation: 49.216 tokens-per-sec

Additional Information

The issue persists across different quantizations.
Similar tests with the llava-1.5-7b model in both HF, my Linux rig, and mlx_lvm environments show consistent and more accurate results:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x174971490> 

Prompt: <s>[INST] <image>
Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English. [/INST]
</s>
The echo of the old man's footsteps fades down the hall.
==========
Prompt: 16.869 tokens-per-sec
Generation: 8.546 tokens-per-sec

Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the transformers code and my limited familiarity with mlx_lvm.

The text was updated successfully, but these errors were encountered:

Blaizzy · 2024-05-25T11:58:03Z

Hey @civvic,

Thanks for bring this issue up!

I will look into it.

Blaizzy · 2024-05-25T19:21:24Z

@civvic it's fixed ✅

Just update your release to v0.0.7

pip install -U mlx-vlm

Let me know if you face any other issues :)

civvic · 2024-05-25T21:01:54Z

Works great, thanks! That’s was quick!

Now on to Paligemma. I’m also interested in Phi-3 V, maybe now it’s the right time to try to decipher the spaghetti transformers and start with mlx.

Blaizzy · 2024-05-25T21:36:33Z

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

Blaizzy · 2024-05-25T21:43:04Z

Paligemma works as well:

Blaizzy · 2024-05-25T21:43:30Z

Regarding Phi-3 Vision checkout #28

civvic · 2024-05-27T16:29:44Z

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

Ah, I've already been tuning into your channel! 😄 The new Llama 3 series looks very intersting. My struggle, though, isn't so much with the Transformers architecture itself—more about navigating through the labyrinth of HuggingFace's transformers code. I stumbled upon their blog post explaining the 'benefits' of spaghetti code, and let's just say, it all makes a bit more sense now why things are the way they are! 🍝

Blaizzy mentioned this issue May 25, 2024

Fix idefics2 OCR #31

Merged

Blaizzy closed this as completed in #31 May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

civvic commented May 25, 2024

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

civvic commented May 25, 2024

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

civvic commented May 27, 2024

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

Comments

civvic commented May 25, 2024

Additional Information

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

civvic commented May 25, 2024

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

Blaizzy commented May 25, 2024

civvic commented May 27, 2024