⚠️ Archived, moved to Codeberg: https://codeberg.org/DecaTec/OCRmyFiles ⚠️
Thus, this GitHub repository is outdated and not longer maintained on GitHub. Please update your references.
Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR).
Adds an OCR text layer to all PDF files in the given input directory and saves the new PDF files to the output directory.
When the input directory also contains image files (e.g. jpg, png), these are converted to (OCR'ed) PDFs.
All other file types are just copied from the input directory to the output directory.
- OCRmyPDF
For Debian 9/Ubuntu 16.10:apt-get install ocrmypdf
For other distros: https://ocrmypdf.readthedocs.io/en/latest/installation.html - Tesseract
This is installed with OCRmyPDF automatically - Tesseract language files
e.g.apt-get install tesseract-ocr-deu
for German language
- Download script or clone repository
- Make script executable
sudo chmod +x OCRmyFiles.sh
- Modify the script to fit your needs:
- Call the script:
OCRmyFiles.sh
(no parameter): using default directories for input/output (as defined in the script itself)OCRmyFiles.sh <inputDir> <outputDir>
: using specified directories for input/output
- The script might print some warnings/errors from Tesseract. These can be ignored in most cases as the OCR text layer will be created anyway
- You can also call this script with a cronjob for automated processing of PDFs/images:
- With the user the cronjob should be executed, call
contab -e
- Add the following to run the script e.g. every 30 minutes:
*/30 * * * * /path/to/the/script/OCRmyFiles.sh > /dev/null 2>&1
- With the user the cronjob should be executed, call