Tabula

Tabula helps you liberate data tables trapped inside PDF files.

Why Tabula?

If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface:

{TODO: screenshot / screencast here}

Caveat: Tabula only works on text-based PDFs, not scanned documents.

Amazon EC2 AMI

An Amazon EC2 AMI image is provided to give you a chance to boot up a quick test server: ami-e895f081

You can find a simple how-to in docs/ami-install.md.

Caveats

Note the EC2 instance types and EC2 pricing. We’re not responsible for any costs this may incur.

Also, please note that this image is a development demo image and may not be secure. Using this AMI for mission-critical or sensitive documents is currently not recommended.

Manual Installation (OS X or Linux)

(Note: A comprehensive, mostly copy-and-paste set of instructions is available for OS X users that normally don't do Ruby development but are interested bootstrapping Tabula on their own computer: docs/osx-simple-bootstrap.md)

Install Ruby and JRuby. Tabula has been tested with Ruby 1.9.3 and JRuby 1.7.3. Use of a Ruby version manager is recommended. Both rbenv and RVM are fine choices. (JRuby is required to interface with pdfbox, but native Ruby must also be used since ruby-opencv is a natively compiled extension.)

If using rbenv:
```
rbenv install 1.9.3-p392
rbenv install jruby-1.7.3
```
If using rvm:
```
rvm install 1.9.3-p392
rvm install jruby-1.7.3
```
(Mac OS X only) Download and install XQuartz: https://xquartz.macosforge.org/landing/

Install the rest of the dependencies: (TODO: instructions for non-OSX platforms.)

# Install Python, setuptools, and pip.  You can skip this
# if you already have them.
brew install python
curl http://python-distribute.org/distribute_setup.py | python
curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python

# Install numpy (feel free to put it in a virtualenv); opencv dependency
pip install numpy

# Add the "science" tap to Homebrew so it can find OpenCV (if you haven't already)
brew tap homebrew/science

brew install opencv --with-tbb --with-opencl --with-qt
brew install mupdf redis

Download Tabula and install the Ruby dependencies. (Note: ensure that rbenv is configured for the standard Ruby interpreter, not JRuby)
```
git clone git://github.com/jazzido/tabula.git
cd tabula

gem install bundler
bundle install
```
Configure Tabula: Copy local_settings-example.rb to local_settings.rb. Edit local_settings.rb and set JRUBY_PATH to the path to the jruby executable.

If you are using rbenv, you can find the path to jruby by doing:
```
RBENV_VERSION='jruby-1.7.3' rbenv which jruby
```

Starting the Server (Dev)

Start redis-server in a separate terminal tab

redis-server /usr/local/etc/redis.conf

Next, you need to start resque and the actual web server. You can run both of those using Foreman by running the following:

bundle exec foreman start

The site instance should now be viewable at http://127.0.0.1:9292/

Contributing

Interested in helping out? See TODO.md for ideas.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
docs		docs
lib		lib
static		static
tabula_extractor		tabula_extractor
test		test
utils		utils
views		views
.env		.env
.gems		.gems
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.md		LICENSE.md
NOTICE.txt		NOTICE.txt
Procfile		Procfile
README.md		README.md
Rakefile		Rakefile
TODO.md		TODO.md
config.ru		config.ru
local_settings-example.rb		local_settings-example.rb
tabula_debug.rb		tabula_debug.rb
tabula_web.rb		tabula_web.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tabula

Why Tabula?

Amazon EC2 AMI

Caveats

Manual Installation (OS X or Linux)

Starting the Server (Dev)

Contributing

About

Releases

Packages

Languages

License

jmccarthy/tabula

Folders and files

Latest commit

History

Repository files navigation

Tabula

Why Tabula?

Amazon EC2 AMI

Caveats

Manual Installation (OS X or Linux)

Starting the Server (Dev)

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages