-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the changes vs the mainstream version #2
Comments
Please check the 2.4.1-nlm branch here: https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm. The list of files changed are in the NOTICE and README. The following files are changed:
The above is to add font and co-ordinates to every text element. It also removes watermarks. The above is to add lines and rectangles that can potentially help with table detection. To see the impact of these changes, see the first part of the notebook here: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb Some ideas for future work:
These changes are for PDF only. There is decent output from tika for other formats like DOCX. |
Thanks it is pretty clear. Maybe there is a way to replace only the pdf parser using external parser (https://tika.apache.org/2.8.0/api/org/apache/tika/parser/external/ExternalParser) or through specific customization. |
I was wondering if you could detail why you had to patch Tika and what you changed high level?
I am already using the vanilla tika version as I was wondering if I could use yours instead (for some of my paperless work) and if there was any drawback?
Did you apply changes only to some type files (how to import txt, pdf, ...)? The project is quite behind vs the main one. Just wondering the reason.
The text was updated successfully, but these errors were encountered: