- Bug reports and user interface feedback are always appreciated.
- Front-end
- Preview pane: Replace the modal with an always-visible data pane.
- Back-end
- Convert ColumnGuesser.java into JRuby or jruby_dump_characters into Java. (This will save significant startup time by not having to start the JVM repeatedly. Or implement Nailgun.)
- Ability to apply a lasso to all subsequent pages in a document.
- Modify imgAreaSelect (or replace it with another JS library) to allow multiple selects. (The data should probably be concatenated, then made available for viewing/download/clipboard.)
- Save a lasso (or set of lassos) for repeated use. Use case: I need to process a document that's published in the same format each month. It'd be quicker for me to set the lasso once and rerun it automatically than to have to set the lasso each month de novo.
- Get rid of XML representation of PDF files.
- Fork Tabula
- Create a topic branch -
git checkout -b my_branch
- Push to your branch -
git push origin my_branch
- Create a Pull Request from your branch
We want to be extra careful about changes in the table extractor lib/tabula.rb
. It is a highly heuristic process and it can regress easily.
If you're doing changes to the table extraction code, please consider adding tests to test/test_table_analyze.rb