Tabula helps you liberate data tables trapped inside PDF files.
© 2012-2013 Manuel Aristarán. Available under MIT License. See AUTHORS.md
and LICENSE.md
.
If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface:
{TODO: screenshot / screencast here}
Caveat: Tabula only works on text-based PDFs, not scanned documents.
An Amazon EC2 AMI image is provided to give you a chance to boot up a quick test server: ami-e895f081
You can find a simple how-to in docs/ami-install.md
.
Note the EC2 instance types and EC2 pricing. We’re not responsible for any costs this may incur.
Also, please note that this image is a development demo image and may not be secure. Using this AMI for mission-critical or sensitive documents is currently not recommended.
(Note: A comprehensive, mostly copy-and-paste set of instructions is available for
OS X users that normally don't do Ruby development but are interested bootstrapping
Tabula on their own computer: docs/osx-simple-bootstrap.md
)
-
Install Ruby and JRuby. Tabula has been tested with Ruby 1.9.3 and JRuby 1.7.3. Use of a Ruby version manager is recommended. Both rbenv and RVM are fine choices. (JRuby is required to interface with
pdfbox
, but native Ruby must also be used sinceruby-opencv
is a natively compiled extension.)If using rbenv:
rbenv install 1.9.3-p392 rbenv install jruby-1.7.3
If using rvm:
rvm install 1.9.3-p392 rvm install jruby-1.7.3
-
(Mac OS X only) Download and install XQuartz: https://xquartz.macosforge.org/landing/
-
Install the rest of the dependencies: (TODO: instructions for non-OSX platforms.)
# Install Python, setuptools, and pip. You can skip this # if you already have them. brew install python curl http://python-distribute.org/distribute_setup.py | python curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python # Install numpy (feel free to put it in a virtualenv); opencv dependency pip install numpy # Add the "science" tap to Homebrew so it can find OpenCV (if you haven't already) brew tap homebrew/science brew install opencv --with-tbb --with-opencl --with-qt brew install mupdf redis
-
Download Tabula and install the Ruby dependencies. (Note: ensure that
rbenv
is configured for the standard Ruby interpreter, not JRuby)git clone git://github.com/jazzido/tabula.git cd tabula gem install bundler bundle install
-
Configure Tabula: Copy
local_settings-example.rb
tolocal_settings.rb
. Editlocal_settings.rb
and setJRUBY_PATH
to the path to thejruby
executable.If you are using rbenv, you can find the path to
jruby
by doing:RBENV_VERSION='jruby-1.7.3' rbenv which jruby
Start redis-server
in a separate terminal tab
redis-server /usr/local/etc/redis.conf
Next, you need to start resque
and the actual web server. You can run both
of those using Foreman by running the
following:
bundle exec foreman start
The site instance should now be viewable at http://127.0.0.1:9292/
Interested in helping out? See TODO.md
for ideas.