Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature for extracting images from pdf and recognizing text from images. #10653

Merged
merged 16 commits into from
Oct 6, 2023

Conversation

therontau0054
Copy link
Contributor

Description

It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with PyPDFLoader, PyPDFium2Loader, PyPDFDirectoryLoader, PyMuPDFLoader, PDFMinerLoader, and PDFPlumberLoader. RapidOCR is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter extract_images is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is:

extract_images PyPDFParser PDFMinerParser PyMuPDFParser PyPDFium2Parser PDFPlumberParser
False 0.27s 0.39s 0.06s 0.08s 1.01s
True 17.01s 20.67s 20.32s 19,75s 20.55s

Issue

#10423

Dependencies

rapidocr_onnxruntime in RapidOCR

@vercel
Copy link

vercel bot commented Sep 15, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 6, 2023 1:22am

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Sep 15, 2023
@therontau0054
Copy link
Contributor Author

@eyurtsev @baskaryan would it be possible to have a look at this? It may be a great feature to add. :)

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 6, 2023
@baskaryan
Copy link
Collaborator

this is awesome, thank @SuperJokerayo!!

@baskaryan baskaryan merged commit 35297ca into langchain-ai:master Oct 6, 2023
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants