diff --git a/README.md b/README.md index a8bb1e3..e05a2e8 100644 --- a/README.md +++ b/README.md @@ -4,25 +4,27 @@ node module that will do OCR on PDFs that do not contain searchable text. ## Inspired from pdf-extract [https://www.npmjs.com/package/pdf-extract] by Noah Isaacson. Many of the ideas initial design are from this project. -## Difference from pdf-extract -- Uses ES6 javascript syntax. -- Uses Promises instead of callbacks -- Option to OCR just the first page of the PDF. (primary reason this was written for my own selfish reason!) -- Currently does not OCR searchable PDFs. Plenty of options out there that does this. -- If you need to OCR searchable PDFs, I recommend using pdf-extract instead. +## Differences between pdf-ocr and pdf-extract +- pdf-extract instructions were outdated when it came to installing dependant binaries. I ran into a couple of pitfalls and wanted to make sure others did not if they used this version. +- Removed the instructions to update the trained data for tesseract, since version 3.05.01 was newer then the instructions on pdf-extract. +- Updated code to use ES6 javascript syntax. +- I Needed an option to OCR just the first page of the PDF. +- This version currently does not OCR searchable PDFs. Plenty of options out there that does this. +- If you need to OCR searchable PDFs, I recommend using pdf-extract instead. (However, use the instructions below to get the dependant binaries.) ## Installation `npm install pdf-ocr --save` -After installing, the following binaries list below will need to be on your system as well as in the paths in your environment settings. +After installing pdf-ocr, the following binaries listed below will need to be on your system, as well as in the paths in your environment settings. ### OSX **pdftk** -[http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/) +- Grab the package installer at [http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/) + +- If you're installing on OSX Sierra or High Sierra, you'll need to make sure you use the package installer pdftk_server-2.02-mac_osx-10.11-setup.pkg -- If you're installing on OSX Sierra or High Sierra, you'll need to make sure you use pdftk_server-2.02-mac_osx-10.11-setup.pkg - Other versions, seemed to hang the process. If the tests fail, this could the main reason why. **ghostscript** @@ -34,12 +36,10 @@ brew install gs `brew install tesseract` -After tesseract is installed you need to install the alphanumeric config and an updated trained data file +After tesseract is installed you need to install the alphanumeric config ``` bash cd -cp "./share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata" -cp "./share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata" -cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric" +cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.05.01/share/tessdata/configs/alphanumeric" ``` ### Ubuntu @@ -63,11 +63,10 @@ apt-get install ghostscript apt-get install tesseract-ocr ``` -For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This module provides an up-to-date version which you should copy into the appropriate location on your system. +For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this module into the *tess-data* folder on your system. ``` bash cd -cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata" cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric" ``` diff --git a/index.js b/index.js new file mode 100644 index 0000000..b7cd4fc --- /dev/null +++ b/index.js @@ -0,0 +1,26 @@ +const path = require('path'); +const fs = require('fs'); +const PdfOcr = require('./lib'); + +module.exports = function (pdfPath, options, cb) { + options = options || {}; + + if (!pdfPath) { + return cb('you must supply a pdf path as the first parameter and options for the second parameter.'); + } + + options.scanFirstPageOnly = options.scanFirstPageOnly || false; + + const processor = new PdfOcr(options); + + fs.exists(pdfPath, (exists) => { + if (!exists) { + return cb('no file exists at the path specified.'); + } + + processor.process(pdfPath, options); + cb(); + }); + + return processor; +} diff --git a/lib/convertToTiff.js b/lib/convertToTiff.js new file mode 100644 index 0000000..f5f747c --- /dev/null +++ b/lib/convertToTiff.js @@ -0,0 +1,37 @@ +const temp = require('temp'); +const path = require('path'); +const exec = require('child_process').exec +const fs = require('fs'); +let pdfConvertQuality = 400; // default to density 400 for the convert command + +module.exports = (inputPath, quality, callback) => { + if (!callback || typeof callback !== 'function') { + callback = quality; // callback must be the second parameter + quality = undefined; // no option passed + } + + fs.exists(inputPath, (exists) => { + if (!exists) { + return callback(`error, no file exists at the path: ${inputPath}`); + } + + const outputPath = temp.path({ prefix: 'tif_output', suffix: '.tif' }); + + if (quality) { + if (typeof (quality) !== 'string' && typeof (quality) !== 'number') { + return callback(`error, pdf quality option must be a string, currently set as: ${typeof (quality)}`); + } + + pdfConvertQuality = quality; + } + + const cmd = 'gs -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw -o "' + outputPath + '" "' + inputPath + '"'; + const child = exec(cmd, (err, stderr, stdout) => { + if (err) { + return callback(err); + } + + return callback(null, outputPath); + }); + }); +} diff --git a/lib/index.js b/lib/index.js new file mode 100644 index 0000000..5e8d6df --- /dev/null +++ b/lib/index.js @@ -0,0 +1,110 @@ +const events = require('events'); +const pathHash = require('pathhash'); +const fs = require('fs'); +const async = require('async'); +const splitPdf = require('./splitPdf.js'); +const convertToTiff = require('./convertToTiff.js'); +const ocr = require('./ocr.js'); + +module.exports = class PdfOcr extends events.EventEmitter { + constructor(options) { + super(); + this.scanOptions = options || this.scanOptions; + } + + process(pdfPath, options) { + const self = this; + const textPages = []; + let splitOutput; + + options = options || { scanFirstPageOnly : false }; + options.clean = options.clean || true; + + fs.exists(pdfPath, (exists) => { + if (!exists) { + var error = `file does not exist at path: ${pdfPath}`; + self.emit('error', { error, pdfPath }); + return; + } + + pathHash(pdfPath, (error, hash) => { + if (error) { + error = `error hashing file at path: ${pdfPath}. ${error}`; + self.emit('error', { error, pdfPath }); + return; + } + + splitPdf(pdfPath, this.scanOptions, (error, output) => { + if (error) { + self.emit('error', { error, pdfPath }); + return; + } + + if (!output) { + error = 'no files returned from split'; + self.emit('error', { error, pdfPath }); + return; + } + + self.emit('log', `finished splitting pages for file at path: ${pdfPath}`); + splitOutput = output; + + const pdfFiles = output.files; + + if (!pdfFiles || !pdfFiles.length) { + error = 'error, no pages where found in pdf document'; + self.emit('error', { error, pdfPath }); + return; + } + + let index = 0; + let numPages = pdfFiles.length; + const singlePagePdfFilePaths = []; + + async.forEachSeries(pdfFiles, (pdfFile, cb) => { + const quality = options.quality ? options.quality : 300; + + convertToTiff(pdfFile.filePath, quality, (err, tiffPath) => { + const zeroBasedNumPages = numPages - 1; + + self.emit('log', `converted page to tiff file, page ${index} of ${zeroBasedNumPages}`); + + if (err) { + return cb(err); + } + + const ocrFlags = options.ocrFlags ? options.ocrFlags : ['-psm 6']; + + ocr(tiffPath, ocrFlags, (err, text) => { + fs.unlink(tiffPath, (tiffError, reply) => { + if (tiffError) { + err += `, error removing tif file: ${tiffError}`; + } + + if (err) { + return cb(err); + } + + self.emit('log', `raw ocr: page ${index} of ${zeroBasedNumPages} complete`); + + singlePagePdfFilePaths.push(pdfFile.filePath); + self.emit('page', { hash, text, index, numPages, pdfPath, singlePagePdfPath: pdfFile.filePath }); + textPages.push(text); + index++; + cb(); + }); + }); + }); + }, (err) => { + if (err) { + self.emit('error', err); + return; + } + + self.emit('complete', { hash, textPages, pdfPath, singlePagePdfFilePaths }); + }); + }); + }); + }); + } +} diff --git a/lib/ocr.js b/lib/ocr.js new file mode 100644 index 0000000..ad73a93 --- /dev/null +++ b/lib/ocr.js @@ -0,0 +1,40 @@ +const temp = require('temp'); +const path = require('path'); +const exec = require('child_process').exec; +const fs = require('fs'); + +module.exports = function (inputPath, options, callback) { + if (!callback || typeof callback !== 'function') { + callback = options; + options = []; + } + + fs.exists(inputPath, (exists) => { + if (!exists) { + return callback(`error, no file exists at the path you specified: ${inputPath}`); + } + + const outputPath = temp.path({ prefix: 'ocr_output' }); + const cmd = 'tesseract "' + inputPath + '" "' + outputPath + '" ' + options.join(' '); + const child = exec(cmd, (err) => { + if (err) { + return callback(err); + } + + const textOutputPath = `${outputPath}.txt`; + fs.readFile(textOutputPath, 'utf8', (err, output) => { + if (err) { + return callback(err); + } + + fs.unlink(textOutputPath, (err) => { + if (err) { + return callback(err); + } + + callback(null, output); + }); + }); + }); + }); +} diff --git a/lib/splitPdf.js b/lib/splitPdf.js new file mode 100644 index 0000000..d9b9cf5 --- /dev/null +++ b/lib/splitPdf.js @@ -0,0 +1,116 @@ +var path = require('path'); +var temp = require('temp'); +var exec = require('child_process').exec; +var fs = require('fs'); +var walk = require('walk'); +var async = require('async'); + +module.exports = (pdfPath, options, callback) => { + + options.scanFirstPageOnly = options.scanFirstPageOnly || false; + + function fileExist(filePath, callback) { + fs.exists(filePath, (exists) => { + if (!exists) { + return callback(`no file at path: ${filePath}`); + } + + return callback(); + }); + } + + function deleteDocData(callback) { + const folder = path.join(__dirname, '..'); + const docDataPath = path.join(folder, 'doc_data.txt'); + + fs.exists(docDataPath, (exists) => { + if (!exists) { + return callback(); + } + + fs.unlink(docDataPath, callback); + }); + } + + function getPdfs(directoryPath, onlyFirstPage, callback) { + const filePaths = []; + const walker = walk.walk(directoryPath, { followLinks: false }); + let files = null; + + walker.on('file', (root, stat, next) => { + if (onlyFirstPage) { + if (stat.name.toLowerCase().endsWith('0001.pdf')) { + const filePath = path.join(directoryPath, stat.name); + filePaths.push({ filePath, fileName: stat.name }); + } + } else { + if (stat.name.match(/\.pdf$/i)) { + const filePath = path.join(directoryPath, stat.name); + filePaths.push({ filePath, fileName: stat.name }); + } + } + + next(); + }); + + walker.on('end', () => { + filePaths.sort((a, b) => { + if (a.fileName < b.fileName) { + return -1; + } + + if (a.fileName == b.fileName) { + return 0; + } + + return 1; + }); + + const output = { + folder: directoryPath, + files: filePaths + } + + callback(null, output); + + return; + }); + } + + fileExist(pdfPath, (err) => { + if (err) { + return callback(err); + } + + var outputDir = temp.path({}, 'pdfPages'); + + fs.mkdir(outputDir, (err) => { + if (err) { + return callback(err, null); + } + + const outputName = 'page%05d.pdf'; + const outputPath = path.join(outputDir, outputName); + const cmd = `pdftk "${pdfPath}" burst output "${outputPath}"`; + + const child = exec(cmd, (err, stdout, stderr) => { + if (err) { + const outputErr = { + message: 'an error occurred while splitting pdf into single pages with the pdftk burst command', + error: err + } + + return callback(outputErr, null); + } + + deleteDocData((err, reply) => { + if (err) { + return callback(err); + } + + return getPdfs(outputDir, options.scanFirstPageOnly, callback); + }); + }); + }); + }); +} diff --git a/share/.DS_Store b/share/.DS_Store new file mode 100644 index 0000000..1bbcd19 Binary files /dev/null and b/share/.DS_Store differ diff --git a/share/configs/alphanumeric b/share/configs/alphanumeric new file mode 100644 index 0000000..2f788d5 --- /dev/null +++ b/share/configs/alphanumeric @@ -0,0 +1 @@ +tessedit_char_whitelist !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ diff --git a/test/.DS_Store b/test/.DS_Store new file mode 100644 index 0000000..2e20b27 Binary files /dev/null and b/test/.DS_Store differ diff --git a/test/01_command-test.js b/test/01_command-test.js new file mode 100644 index 0000000..57a942e --- /dev/null +++ b/test/01_command-test.js @@ -0,0 +1,28 @@ +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const should = require('should'); +const exec = require('child_process').exec; + +describe('01 Command Test', () => { + it('should have ghostscript (gs) binary on path', (done) => { + const cmd = 'which gs'; + const child = exec(cmd, (err, stdout, stderr) => { + should.not.exist(err, 'ghostscript not found. You will not be able to perform ocr and extract text from pdfs with scanned image. install GhostScript on your system'); + stderr.length.should.equal(0); + should.exist(stdout); + stdout.length.should.be.above(8); + done(); + }); + }); + + it('should have tesseract binary on path', (done) => { + const cmd = 'which tesseract'; + + const child = exec(cmd, (err, stdout, stderr) => { + should.not.exist(err, 'tesseract not found. You will not be able to perform ocr and extract from pdfs with scanned images.'); + stderr.length.should.equal(0); + should.exist(stdout); + stdout.length.should.be.above(8); + done(); + }); + }); +}); \ No newline at end of file diff --git a/test/02_split-test.js b/test/02_split-test.js new file mode 100644 index 0000000..9e4509c --- /dev/null +++ b/test/02_split-test.js @@ -0,0 +1,76 @@ +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const path = require('path'); +const should = require('should'); +const assert = require('assert') +const fs = require('fs'); +const async = require('async'); +const split = require('../lib/splitPdf.js'); + +describe('02 Split PDF File Test', () => { + it('should split multi-page pdf in single page pdf files', function (done) { + this.timeout(10 * 1000); + this.slow(2 * 1000); + + const fileName = 'multipage.pdf'; + const relativePath = path.join('testData', fileName); + const pdfPath = path.join(__dirname, relativePath); + const scanOptions = { scanFirstPageOnly: false }; + + split(pdfPath, scanOptions, (err, output) => { + should.not.exist(err); + should.exist(output); + output.should.have.property('folder'); + output.should.have.property('files'); + + const files = output.files; + + files.length.should.equal(2, `wrong number of pages after splitting pdf with name: ${fileName}`); + + async.forEach(files, (file, cb) => { + file.should.have.property('fileName'); + file.should.have.property('filePath'); + fs.exists(file.filePath, (exists) => { + assert.ok(exists, `file does not exist like it should at path: ${file.filePath}`); + cb(); + }); + }, (err) => { + should.not.exist(err); + done(); + }); + }); + }); + + it('should split single page pdf into a new single page pdf files', function (done) { + this.timeout(10 * 1000); + this.slow(2 * 1000); + + const fileName = 'single_page.pdf'; + const relativePath = path.join('testData', fileName); + const pdfPath = path.join(__dirname, relativePath); + const scanOptions = { scanFirstPageOnly: false }; + + split(pdfPath, scanOptions, (err, output) => { + should.not.exist(err); + should.exist(output); + output.should.have.property('folder'); + output.should.have.property('files'); + + const files = output.files; + + files.length.should.equal(1, `wrong number of pages after splitting searchable pdf with name: ${fileName}`); + + async.forEach(files, (file, cb) => { + file.should.have.property('fileName'); + file.should.have.property('filePath'); + + fs.exists(file.filePath, (exists) => { + assert.ok(exists, `file does not exist like it should at path: ${file.filePath}`); + cb(); + }); + }, (err) => { + should.not.exist(err); + done(); + }); + }); + }); +}); \ No newline at end of file diff --git a/test/03_ocr_multi_pages-test.js b/test/03_ocr_multi_pages-test.js new file mode 100644 index 0000000..b71faa9 --- /dev/null +++ b/test/03_ocr_multi_pages-test.js @@ -0,0 +1,68 @@ +const assert = require('assert'); +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const path = require('path'); +const should = require('should'); +const fs = require('fs'); +const async = require('async'); +const pathHash = require('pathhash'); +const pdf = require('../index.js'); + +describe('03 Multipage all pages OCR test', function () { + const fileName = 'multipage.pdf'; + const relativePath = path.join('testData', fileName); + const pdfPath = path.join(__dirname, relativePath); + + const options = { + clean: false, + scanFirstPageOnly: false + }; + + let hash; + + before((done) => { + pathHash(pdfPath, (err, reply) => { + should.not.exist(err, 'error getting sha1 hash of pdf file at path: ' + pdfPath + '. ' + err); + should.exist(reply, 'error getting sha1 hash of pdf file at path: ' + pdfPath + '. No hash returned from hashDataAtPath'); + hash = reply; + done(); + }); + }); + + it('should extract array of text pages from multipage raw scan pdf', function (done) { + this.timeout(240 * 1000); + this.slow(120 * 1000); + + const processor = pdf(pdfPath, options, (err) => should.not.exist(err)); + + processor.on('complete', (data) => { + data.should.have.property('textPages'); + data.should.have.property('pdfPath'); + data.should.have.property('singlePagePdfFilePaths'); + data.textPages.length.should.eql(2, `wrong number of pages after extracting from mulitpage searchable pdf with name: ${fileName}`); + + assert.ok(pageEventFired, 'page event not fired'); + async.forEach(data.singlePagePdfFilePaths, (filePath, cb) => { + fs.exists(filePath, (exists) => { + assert.ok(exists, `single page pdf file does not exist at the path: ${filePath}`); + cb(); + }); + }, (err) => { + should.not.exist(err, `error in raw processing: ${err}`); + done(); + }); + }); + + processor.on('log', (data) => inspect(data, 'log data')); + + var pageEventFired = false; + + processor.on('page', (data) => { + pageEventFired = true; + data.should.have.property('index'); + data.should.have.property('pdfPath'); + data.should.have.property('text'); + data.pdfPath.should.eql(pdfPath); + data.text.length.should.above(0); + }); + }); +}); diff --git a/test/04_ocr_first_page-test.js b/test/04_ocr_first_page-test.js new file mode 100644 index 0000000..c0ada0c --- /dev/null +++ b/test/04_ocr_first_page-test.js @@ -0,0 +1,70 @@ +const assert = require('assert'); +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const path = require('path'); +const should = require('should'); +const fs = require('fs'); +const async = require('async'); +const pathHash = require('pathhash'); +const pdf = require('../index.js'); + +describe('04 Multipage scan only 1st Page test', () => { + const fileName = 'multipage.pdf'; + const relativePath = path.join('testData', fileName); + const pdfPath = path.join(__dirname, relativePath); + + const options = { + clean: false, + scanFirstPageOnly: true + }; + + let hash; + + before((done) => { + pathHash(pdfPath, (err, reply) => { + should.not.exist(err, `error getting sha1 hash of pdf file at path: ${pdfPath}. ${err}`); + should.exist(reply, `error getting sha1 hash of pdf file at path: ${pdfPath}. No hash returned from hashDataAtPath`); + hash = reply; + done(); + }); + }); + + it('should extract array of text pages from multipage raw scan pdf', function (done) { + this.timeout(240 * 1000); + this.slow(120 * 1000); + + const processor = pdf(pdfPath, options, (err) => should.not.exist(err)); + + processor.on('complete', (data) => { + data.should.have.property('textPages'); + data.should.have.property('pdfPath'); + data.should.have.property('singlePagePdfFilePaths'); + data.textPages.length.should.eql(1, `wrong # of pages after extracting from mulitpage pdf with name: ${fileName}`); + + assert.ok(pageEventFired, 'no page event fired'); + async.forEach( + data.singlePagePdfFilePaths, (filePath, cb) => { + fs.exists(filePath, (exists) => { + assert.ok(exists, `no single page pdf file exists at the path: ${filePath}`); + cb(); + }); + }, (err) => { + should.not.exist(err, `error in raw processing: ${err}`); + done(); + } + ); + }); + + processor.on('log', (data) => inspect(data, 'log data')); + + let pageEventFired = false; + + processor.on('page', (data) => { + pageEventFired = true; + data.should.have.property('index'); + data.should.have.property('pdfPath'); + data.should.have.property('text'); + data.pdfPath.should.eql(pdfPath); + data.text.length.should.above(0); + }); + }); +}); diff --git a/test/05_convert-to-tiff-test.js b/test/05_convert-to-tiff-test.js new file mode 100644 index 0000000..b9ce249 --- /dev/null +++ b/test/05_convert-to-tiff-test.js @@ -0,0 +1,32 @@ +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const path = require('path'); +const should = require('should'); +const assert = require('assert'); +const fs = require('fs'); +const async = require('async'); +const convert = require('../lib/convertToTiff.js'); + +describe('05 Convert Test', function () { + + it('should convert raw single page pdf to tif file', function (done) { + this.timeout(10 * 1000); + + const fileName = 'single_page.pdf'; + const relativePath = path.join('testData', fileName); + const pdfPath = path.join(__dirname, relativePath); + + fs.exists(pdfPath, (exists) => { + assert.ok(exists, `file does not exist at path: ${pdfPath}`); + + convert(pdfPath, (err, tiffPath) => { + should.not.exist(err); + should.exist(tiffPath); + + fs.exists(tiffPath, (exists) => { + assert.ok(exists, `tiff file does not exist at path: ${tiffPath}`); + done(); + }); + }); + }); + }); +}); diff --git a/test/06_ocr-tiff-test.js b/test/06_ocr-tiff-test.js new file mode 100644 index 0000000..41f072e --- /dev/null +++ b/test/06_ocr-tiff-test.js @@ -0,0 +1,28 @@ +const inspect = require('eyespect').inspector({ maxLength: 20000 }); +const path = require('path'); +const should = require('should'); +const assert = require('assert'); +const fs = require('fs'); +const ocr = require('../lib/ocr.js'); + +describe('06 OCR Test', function () { + it('should extract text from tiff file via tesseract ocr', function (done) { + this.timeout(100 * 1000); + this.slow(20 * 1000); + + const fileName = 'single_page.tif'; + const relativePath = path.join('testData', fileName); + const tiffPath = path.join(__dirname, relativePath); + + fs.exists(tiffPath, function (exists) { + assert.ok(exists, `tiff file does not exist at path: ${tiffPath}`); + + ocr(tiffPath, (err, extract) => { + should.not.exist(err); + should.exist(extract); + extract.length.should.be.above(20, 'wrong ocr output'); + done(); + }); + }); + }); +}); diff --git a/test/testData/multipage.pdf b/test/testData/multipage.pdf new file mode 100644 index 0000000..dc51046 Binary files /dev/null and b/test/testData/multipage.pdf differ diff --git a/test/testData/single_page.pdf b/test/testData/single_page.pdf new file mode 100644 index 0000000..a0915ac Binary files /dev/null and b/test/testData/single_page.pdf differ diff --git a/test/testData/single_page.tif b/test/testData/single_page.tif new file mode 100644 index 0000000..079fc9d Binary files /dev/null and b/test/testData/single_page.tif differ