Skip to content

Proof of concept for extracting CSV data from image based pdfs using open source tools

Notifications You must be signed in to change notification settings

Theni-N-Lingeswaran/pdf-to-csv-ruby

 
 

Repository files navigation

Parsing tables from image based PDFs with open source tools

$ brew install poppler
$ brew install tesseract --HEAD
$ brew install imagemagick --with-fftw
$ brew install gocr --with-lib --with-netpbm

To run

$ pdfimages -png aviva_plc_annual_return_2014.pdf /tmp/out
$ cp /tmp/out-037.png .
$ ruby ocr.rb out-037.png

About

Proof of concept for extracting CSV data from image based pdfs using open source tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 92.5%
  • Ruby 7.5%