GitHub - greydongilmore/ocr-pdf: Convert PDFs to OCR.

OCR PDF

PDF OCR conversion tool
Explore the docs »

Report Bug · Request Feature

About The Project

This code snippet will search a given directory for PDF files that are non-searchable and convert them to searchable PDFs (OCR). When you obtain manuscript PDF files from online databases, they may not be in a searchable format. This means you are unable to highlight and search for text within the PDF. This small Python function will recursively search though a directory containing PDF files, determine the PDF files that are non-searchable, and convert to a searchable format. Optical Character Recognition (OCR) is a method to enable text recognition within images and documents. PDFs contain vector graphics that can contain raster objects (.png, .jpg etc.). The OCR process will first rasterize each page of the PDF file then an OCR "layer" is created.

This python function wraps the command-line program OCRmyPDF.

Built With

Python version: 3.9

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Install OCRmyPDF

Installation

In a terminal, clone the repo by running:

git clone https://github.com/greydongilmore/ocr-pdf.git

Change into the project directory (update path to reflect where you stored this project directory):
```
cd /home/user/Documents/Github/ocr-pdf
```

Install the required Python packages:

python -m pip install -r requirements.txt

Usage

In a terminal, move into the project directory
```
cd /home/user/Documents/Github/ocr-pdf
```

Run the following to execute the epoch script:

python main.py -i "full/path/to/PDF/storage/diectory"

-i: full directory path to the PDF storage directory

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Greydon Gilmore - @GilmoreGreydon - [email protected]

Project Link: https://github.com/greydongilmore/ocr-pdf

Acknowledgements

README format was adapted from Best-README-Template

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
imgs		imgs
.gitattributes		.gitattributes
LICENSE.txt		LICENSE.txt
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR PDF

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Contributing

License

Contact

Acknowledgements

About

Releases

Packages

Languages

License

greydongilmore/ocr-pdf

Folders and files

Latest commit

History

Repository files navigation

OCR PDF

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Contributing

License

Contact

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages