What are the changes vs the mainstream version #2

shuther · 2024-01-24T09:06:13Z

I was wondering if you could detail why you had to patch Tika and what you changed high level?
I am already using the vanilla tika version as I was wondering if I could use yours instead (for some of my paperless work) and if there was any drawback?
Did you apply changes only to some type files (how to import txt, pdf, ...)? The project is quite behind vs the main one. Just wondering the reason.

ansukla · 2024-01-24T14:01:18Z

Please check the 2.4.1-nlm branch here: https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm. The list of files changed are in the NOTICE and README.

The following files are changed:

The above is to add font and co-ordinates to every text element. It also removes watermarks.

https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/GraphicsStreamProcessor.java

The above is to add lines and rectangles that can potentially help with table detection.

To see the impact of these changes, see the first part of the notebook here: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb

Some ideas for future work:

Make the changes independent of tika by writing own wrapper over pdfbox
Upgrade to latest version of tika
Cleanup the format of returned html to make it more css friendly

These changes are for PDF only. There is decent output from tika for other formats like DOCX.

shuther · 2024-02-16T15:13:22Z

Thanks it is pretty clear. Maybe there is a way to replace only the pdf parser using external parser (https://tika.apache.org/2.8.0/api/org/apache/tika/parser/external/ExternalParser) or through specific customization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the changes vs the mainstream version #2

What are the changes vs the mainstream version #2

shuther commented Jan 24, 2024

ansukla commented Jan 24, 2024 •

edited

Loading

shuther commented Feb 16, 2024

What are the changes vs the mainstream version #2

What are the changes vs the mainstream version #2

Comments

shuther commented Jan 24, 2024

ansukla commented Jan 24, 2024 • edited Loading

shuther commented Feb 16, 2024

ansukla commented Jan 24, 2024 •

edited

Loading