-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments #30
Comments
Hey @civvic, Thanks for bring this issue up! I will look into it. |
@civvic it's fixed ✅ Just update your release to
Let me know if you face any other issues :) |
Works great, thanks! That’s was quick! Now on to Paligemma. I’m also interested in Phi-3 V, maybe now it’s the right time to try to decipher the spaghetti transformers and start with mlx. |
Most welcome! If you want to understand transformers code better you can check my video series on YT. It's about Llama-2 arch but it generalizes :) https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz |
Regarding Phi-3 Vision checkout #28 |
Ah, I've already been tuning into your channel! 😄 The new Llama 3 series looks very intersting. My struggle, though, isn't so much with the Transformers architecture itself—more about navigating through the labyrinth of HuggingFace's transformers code. I stumbled upon their blog post explaining the 'benefits' of spaghetti code, and let's just say, it all makes a bit more sense now why things are the way they are! 🍝 |
I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.
The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.
mlx-community/idefics2-8b-4bit
(also tested with8bit
)The expected output should closely match the results from other environments, such as:
["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]
I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.
The output from mlx_vlm is significantly different and less accurate:
Additional Information
Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the
transformers
code and my limited familiarity with mlx_lvm.The text was updated successfully, but these errors were encountered: