Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pytesseract method to use OCR on flat pdfs. #5

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jokerale
Copy link

Added pytesseract support in order to be able to scan flat pdfs (those that contains images as pages) and to retrieve the text inside it. Added also a little check that trigger the function using OCR when zero lines of text are found in a pdf.
Also added libraries used in the requirements files.

@Wazzabeee
Copy link
Owner

Buonasera 🇮🇹

Thanks for adding this! Could you rebase your PR with the latest commits of the repo? I added some checks on code quality and reformatting it should not cause conflicts with your code. Also to merge this PR it would be nice to add a pdf that contains only scanned text so that the example now supports and works with scanned text.

If you know how to It would be perfect if you could add one or more tests to test your changes.

I created this project a long time ago so I know the current code is not tested properly, but I will gradually take the time to add tests for all my functions.

Thanks in advance !

jokerale and others added 2 commits May 8, 2024 16:26
feat: add pre commit to repo

fix: remove init

fix: scripts structure

Bump black from 23.11.0 to 24.3.0

Bumps [black](https://github.com/psf/black) from 23.11.0 to 24.3.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](psf/black@23.11.0...24.3.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Bump nltk from 3.6.3 to 3.6.6

Bumps [nltk](https://github.com/nltk/nltk) from 3.6.3 to 3.6.6.
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](nltk/nltk@3.6.3...3.6.6)

---
updated-dependencies:
- dependency-name: nltk
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

fix: readme & saving path

feat: add setup changelog and version (Wazzabeee#8)

First release

fix: rename package for pypi (Wazzabeee#9)

rename package from plagiarism-checker to plagiarism-detector

fix: rename pypi package (Wazzabeee#10)

fix: rename files with copy-spotter name

feat: add tags and automatic versioning
@jokerale
Copy link
Author

jokerale commented May 8, 2024

Bonsoir 🇫🇷

I've tried to rebase the PR with the latest commits.
Please let me know if this is the right way.

I'll add some tests for the OCR function with the added pdf in future PR.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants