Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizations for image embedding with vision #2133

Open
cforce opened this issue Nov 9, 2024 · 1 comment
Open

optimizations for image embedding with vision #2133

cforce opened this issue Nov 9, 2024 · 1 comment

Comments

@cforce
Copy link
Contributor

cforce commented Nov 9, 2024

Could you help clarify why this warning is necessary? It clutters the console, and the purpose isn't entirely clear.

Additionally, I've noticed that when using the Vision API, all pages from documents (like PDFs) are stored as PNGs, even if there isn’t a single image on the page. Is there a reason for this? Couldn’t we apply the Vision API selectively, using it only for pages containing images? This would avoid the extra processing effort and the token usage involved in storing simple text pages as images. Converting text to images for Vision seems to double the runtime, increase blob storage, and create unnecessary index chunks for text-only pages.

Unless I’ve misunderstood, it would make sense to use Vision solely on image-containing pages, achieving the best of both approaches without doubling token consumption. Could you provide some insight into this approach? I’m still exploring the code and would appreciate a better understanding.

There is also some dead code, where i am not sure if its leftover or not finished feature.
"has_image_embeddings" is not used at

self.has_image_embeddings = has_image_embeddings

so its also not needed at
search_images: bool = False,

I wonder if its ok to enable vision in the app but selectively run prepdocs only for sources with heavy image content is ok. Will the app be able to deal with mixed embedding? If prepdocs would decide on the fly per page if its image heavy or not this could improve speed and costs and only use vision where useful.

@cforce
Copy link
Contributor Author

cforce commented Nov 10, 2024

A very interesting approach is that notebook https://github.com/douglasware/ElumenotionSite/blob/master/Projects/PdfToMarkdownAndQaPairs/v4omni-image-plus-docIntelOcr.ipynb?short_path=78bb846
The notebook provides a streamlined approach for processing OCR data from images. The workflow involves converting each page of a PDF into an OCR-generated markdown file, enriched with image descriptions and MermaidJS diagrams through GPT-4o. A structured prompt directs GPT-4o to transcribe the document's text and recreate tables while inserting descriptive text for figures. These descriptions may include additional diagrams generated with valid MermaidJS syntax. MermaidJS diagram guidelines ensure correct syntax, emphasizing the use of alphanumeric characters and underscores for node IDs, and requiring special characters in labels to be enclosed in double quotes. The notebook process, which costs around $0.03 and takes under 10 seconds per page, requires initial processing of images via PdfToPageImages.ipynb and DocIntelligencePipeline.ipynb to generate markdown content.
In Azure AI Search, indexing the markdown rather than the original PDF improves efficiency. The new index reduces storage while retaining much of the document’s critical content, resulting in fewer chunks than indexing the entire PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant