Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the changes vs the mainstream version #2

Open
shuther opened this issue Jan 24, 2024 · 2 comments
Open

What are the changes vs the mainstream version #2

shuther opened this issue Jan 24, 2024 · 2 comments

Comments

@shuther
Copy link

shuther commented Jan 24, 2024

I was wondering if you could detail why you had to patch Tika and what you changed high level?
I am already using the vanilla tika version as I was wondering if I could use yours instead (for some of my paperless work) and if there was any drawback?
Did you apply changes only to some type files (how to import txt, pdf, ...)? The project is quite behind vs the main one. Just wondering the reason.

@ansukla
Copy link
Member

ansukla commented Jan 24, 2024

Please check the 2.4.1-nlm branch here: https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm. The list of files changed are in the NOTICE and README.

The following files are changed:

  1. https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
  2. https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java

The above is to add font and co-ordinates to every text element. It also removes watermarks.

  1. https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/GraphicsStreamProcessor.java

The above is to add lines and rectangles that can potentially help with table detection.

To see the impact of these changes, see the first part of the notebook here: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb

Some ideas for future work:

  1. Make the changes independent of tika by writing own wrapper over pdfbox
  2. Upgrade to latest version of tika
  3. Cleanup the format of returned html to make it more css friendly

These changes are for PDF only. There is decent output from tika for other formats like DOCX.

@shuther
Copy link
Author

shuther commented Feb 16, 2024

Thanks it is pretty clear. Maybe there is a way to replace only the pdf parser using external parser (https://tika.apache.org/2.8.0/api/org/apache/tika/parser/external/ExternalParser) or through specific customization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants