Skip to content

Commit

Permalink
updated README.md and added required npm modules to the package.json …
Browse files Browse the repository at this point in the history
…file
  • Loading branch information
nielswh committed Mar 10, 2018
1 parent 81af509 commit f706172
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 14 deletions.
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,29 @@
# pdf-ocr
node module that will do OCR on PDFs that do not contain text
node module that will do OCR on PDFs that do not contain searchable text.

## Inspired from pdf-extract
[https://www.npmjs.com/package/pdf-extract] by Noah Isaacson. Many of the ideas initial design are from this project.

## Difference from pdf-extract
- Uses ES6 javascript
- using Promises instead of callbacks
- allows for extracting only the first page of the PDF to OCR (primary reason this was written for my own selfish reason!)
- does not do searchable PDFs. Primary reason was it was next to impossible to get poppler instead on Windows. In the future I may add this
- if you need searchable PDFs, then I recommend using pdf-extract.
- Uses ES6 javascript syntax.
- Uses Promises instead of callbacks
- Option to OCR just the first page of the PDF. (primary reason this was written for my own selfish reason!)
- Currently does not OCR searchable PDFs. Plenty of options out there that does this.
- If you need to OCR searchable PDFs, I recommend using pdf-extract instead.

## Installation

`npm install pdf-ocr --save`

After the installed, the following binaries need to be on in the path in order for the module to work.
After installing, the following binaries list below will need to be on your system as well as in the paths in your environment settings.

### OSX

**pdftk**
[http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/)

- If you are installing on Siera or High Siera, you'll need to make sure you grab pdftk_server-2.02-mac_osx-10.11-setup.pkg
- Other versions, seemed to hang and not return any results.
- If you're installing on OSX Sierra or High Sierra, you'll need to make sure you use pdftk_server-2.02-mac_osx-10.11-setup.pkg
- Other versions, seemed to hang the process. If the tests fail, this could the main reason why.

**ghostscript**
``` bash
Expand Down Expand Up @@ -63,7 +63,8 @@ apt-get install ghostscript
apt-get install tesseract-ocr
```

For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this pdf-extract module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system
For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This module provides an up-to-date version which you should copy into the appropriate location on your system.

``` bash
cd <root of this module>
cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"
Expand All @@ -81,6 +82,7 @@ cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alp

- Rename the *gswin64c* to *gs*, and add the bin folder to your PATH.

**tesseract** can be build, but you can also download an older version which seems to work fine.
- Download at: https://sourceforge.net/projects/tesseract-ocr-alt/files/
- Version I know works is *tesseract-ocr-setup-3.02.02.exe*
**tesseract**
- Download at: [https://sourceforge.net/projects/tesseract-ocr-alt/files/](https://sourceforge.net/projects/tesseract-ocr-alt/files/)

- *tesseract-ocr-setup-3.02.02.exe* is a version that I know works.
14 changes: 13 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,24 @@
"description": "node module that can ocr pdfs that are not searchable",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
"test": "node_modules/.bin/mocha --reporter spec"
},
"repository": {
"type": "git",
"url": "git+https://github.com/nielswh/pdf-ocr.git"
},
"folders": "lib",
"dependencies": {
"async": "~0.1.22",
"temp": "~0.8.3",
"walk": "~2.2.1",
"pathhash": "~1.0.0"
},
"devDependencies": {
"eyespect": "~0.1.8",
"should": "~1.2.1",
"mocha": "~1.8.1"
},
"keywords": [
"pdf",
"ocr"
Expand Down

0 comments on commit f706172

Please sign in to comment.