Add feature for extracting images from pdf and recognizing text from images. #10653

therontau0054 · 2023-09-15T18:58:08Z

Description

It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with PyPDFLoader, PyPDFium2Loader, PyPDFDirectoryLoader, PyMuPDFLoader, PDFMinerLoader, and PDFPlumberLoader. RapidOCR is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter extract_images is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is:

extract_images	PyPDFParser	PDFMinerParser	PyMuPDFParser	PyPDFium2Parser	PDFPlumberParser
False	0.27s	0.39s	0.06s	0.08s	1.01s
True	17.01s	20.67s	20.32s	19,75s	20.55s

Issue

#10423

Dependencies

rapidocr_onnxruntime in RapidOCR

…image

…mages

vercel · 2023-09-15T18:58:13Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 6, 2023 1:22am

…image

therontau0054 · 2023-09-22T06:53:53Z

@eyurtsev @baskaryan would it be possible to have a look at this? It may be a great feature to add. :)

baskaryan · 2023-10-06T01:51:53Z

this is awesome, thank @SuperJokerayo!!

therontau0054 added 9 commits September 14, 2023 00:03

add feature for exacting images

4515a9a

test assets

7080172

push

816d8b0

Merge branch 'master' of github.com:SuperJokerayo/langchain into pdf_…

aa661fa

…image

delete some test files

811030e

Merge branch 'master' of github.com:SuperJokerayo/langchain into pdf_…

71a895a

…image

add ocr dependency rapidocr_onnxruntime for extracting text from images

b449735

add feature of extracting images from pdf and recognizing text from i…

e41e68a

…mages

add new unit_tests for new feature

2daa5cd

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Sep 15, 2023

Merge branch 'master' of github.com:SuperJokerayo/langchain into pdf_…

6fab211

…image

therontau0054 force-pushed the pdf_image branch from 6d31616 to fe8dffc Compare September 16, 2023 05:58

fix some ci/cd bugs

6c10c2a

therontau0054 force-pushed the pdf_image branch from fe8dffc to 6c10c2a Compare September 16, 2023 06:03

therontau0054 and others added 2 commits September 16, 2023 14:07

fix some ci/cd bugs

dd37b44

Merge branch 'master' into pdf_image

7c04d90

vercel bot deployed to Preview September 17, 2023 18:09 View deployment

baskaryan assigned eyurtsev Sep 20, 2023

merge

95ea556

vercel bot deployed to Preview October 5, 2023 23:54 View deployment

baskaryan added 2 commits October 5, 2023 17:59

docs

aace0cc

nit

47b2bb7

baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 6, 2023

vercel bot deployed to Preview October 6, 2023 01:22 View deployment

baskaryan merged commit 35297ca into langchain-ai:master Oct 6, 2023
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature for extracting images from pdf and recognizing text from images. #10653

Add feature for extracting images from pdf and recognizing text from images. #10653

therontau0054 commented Sep 15, 2023

vercel bot commented Sep 15, 2023 •

edited

Loading

therontau0054 commented Sep 22, 2023

baskaryan commented Oct 6, 2023

Add feature for extracting images from pdf and recognizing text from images. #10653

Add feature for extracting images from pdf and recognizing text from images. #10653

Conversation

therontau0054 commented Sep 15, 2023

vercel bot commented Sep 15, 2023 • edited Loading

therontau0054 commented Sep 22, 2023

baskaryan commented Oct 6, 2023

vercel bot commented Sep 15, 2023 •

edited

Loading