Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30

Closed
civvic opened this issue May 25, 2024 · 7 comments · Fixed by #31
Closed

Comments

@civvic
Copy link

civvic commented May 25, 2024

I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.

Strange_Tales_172005_7_Default, grey pad

The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.

  • mlx_vlm version: "0.0.6", dev install
  • Model Used: mlx-community/idefics2-8b-4bit (also tested with 8bit)
  • Code Snippet:
    model, processor = load("mlx-community/idefics2-8b-4bit")
    prompt_text_tmpl = "Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English."
    resulting_messages = [
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text_tmpl}]}
    ]
    prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True)
    output = generate(model, processor, image, prompt, temp=0.4, max_tokens=512, top_p=0.8, verbose=True)

The expected output should closely match the results from other environments, such as:

["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]

I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.

The output from mlx_vlm is significantly different and less accurate:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x537C49490> 

Prompt: User:<image>Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English.<end_of_utterance>
Assistant:
The text consists of a single word: "down".<end_of_utterance>
==========
Prompt: 76.531 tokens-per-sec
Generation: 49.216 tokens-per-sec

Additional Information

  • The issue persists across different quantizations.
  • Similar tests with the llava-1.5-7b model in both HF, my Linux rig, and mlx_lvm environments show consistent and more accurate results:
==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x174971490> 

Prompt: <s>[INST] <image>
Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English. [/INST]
</s>
The echo of the old man's footsteps fades down the hall.
==========
Prompt: 16.869 tokens-per-sec
Generation: 8.546 tokens-per-sec

Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the transformers code and my limited familiarity with mlx_lvm.

@Blaizzy
Copy link
Owner

Blaizzy commented May 25, 2024

Hey @civvic,

Thanks for bring this issue up!

I will look into it.

@Blaizzy
Copy link
Owner

Blaizzy commented May 25, 2024

@civvic it's fixed ✅

Just update your release to v0.0.7

pip install -U mlx-vlm

Let me know if you face any other issues :)

@civvic
Copy link
Author

civvic commented May 25, 2024

Works great, thanks! That’s was quick!

Now on to Paligemma. I’m also interested in Phi-3 V, maybe now it’s the right time to try to decipher the spaghetti transformers and start with mlx.

@Blaizzy
Copy link
Owner

Blaizzy commented May 25, 2024

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

@Blaizzy
Copy link
Owner

Blaizzy commented May 25, 2024

Paligemma works as well:

Screenshot 2024-05-25 at 11 42 14 PM

@Blaizzy
Copy link
Owner

Blaizzy commented May 25, 2024

Regarding Phi-3 Vision checkout #28

@civvic
Copy link
Author

civvic commented May 27, 2024

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

Ah, I've already been tuning into your channel! 😄 The new Llama 3 series looks very intersting. My struggle, though, isn't so much with the Transformers architecture itself—more about navigating through the labyrinth of HuggingFace's transformers code. I stumbled upon their blog post explaining the 'benefits' of spaghetti code, and let's just say, it all makes a bit more sense now why things are the way they are! 🍝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants