Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parsing of PDF titles #392

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jassob
Copy link

@Jassob Jassob commented Apr 10, 2021

This commit attempts to parse the title metadata in a PDF file if it
exists, otherwise it fallbacks to printing only the media type and
size.

We use the pdf crate to parse all of the response body (for some
reason metadata and table of contents are kept at the end of PDFs) and
then ask the PDF for the title that may be defined in the "info
dictionary
".

@Jassob
Copy link
Author

Jassob commented Apr 10, 2021

First off, thanks for a splendid bot!

We've been using url-bot-rs for a while to save everybody from having to click on links in order to find what the post was about. However, one thing we thought was missing was the option to print the titles of PDF files (especially when the URL to papers and such often are some kind of ID and not something helpful for us mere humans). So that's where this contribution comes in.

Feel free to request changes or simply close the PR if this is not as minimal as you want url-bot-rs to be :)

@Jassob Jassob force-pushed the add-pdf-title-parsing branch from 6446b25 to 6cf9270 Compare April 10, 2021 18:20
@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Merging #392 (72555b9) into master (b4175be) will increase coverage by 0.25%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #392      +/-   ##
==========================================
+ Coverage   84.37%   84.63%   +0.25%     
==========================================
  Files          10       10              
  Lines        1466     1497      +31     
==========================================
+ Hits         1237     1267      +30     
- Misses        229      230       +1     
Impacted Files Coverage Δ
src/http.rs 97.34% <87.50%> (-0.24%) ⬇️
src/title.rs 99.20% <100.00%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4175be...72555b9. Read the comment docs.

@Jassob
Copy link
Author

Jassob commented Apr 10, 2021

Right, I have no tests.. I'll fix that!

@nuxeh
Copy link
Owner

nuxeh commented Apr 12, 2021

Hey :) this is a perfect addition to url-bot-rs

Actually have thought about adding PDF titles for some time, but never got around to it, so thanks for the PR!

I'll wait for the tests for now, but all looks good to me, except it might be good to rework the HTTP side to make use of the chunked download strategy used currently. As it stands, it appears this will download the entire file, which can have some issues, with e.g. very large files, or even malicious attempts to provide never-ending data over HTTP... believe it or not this has been demonstrated already in the past! (#309)

When I get a moment, I'll have a look and do a proper review. Also, if you want to add tests, the way these metadata things are tested at the moment is with locally served files, there is already an test PDF here which could perhaps be used, or modified to have a title usable for testing.

This commit attempts to parse the title metadata in a PDF file if it
exists, otherwise it fallbacks to printing only the media type and
size.

We use the `pdf` crate to parse all of the response body (for some
reason metadata and table of contents are kept at the end of PDFs) and
then ask the PDF for the title that may be defined in the "[info
dictionary]".

[info dictionary]: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
@Jassob Jassob force-pushed the add-pdf-title-parsing branch from 6cf9270 to 72555b9 Compare April 12, 2021 19:18
@Jassob
Copy link
Author

Jassob commented Apr 12, 2021

Hey :) this is a perfect addition to url-bot-rs

Actually have thought about adding PDF titles for some time, but never got around to it, so thanks for the PR!

Glad to hear it and happy to help!

I'll wait for the tests for now, but all looks good to me, except it might be good to rework the HTTP side to make use of the chunked download strategy used currently. As it stands, it appears this will download the entire file, which can have some issues, with e.g. very large files, or even malicious attempts to provide never-ending data over HTTP... believe it or not this has been demonstrated already in the past! (#309)

Aha, I see, I thought the chunk implementation was "only" used to speed up downloads and save bandwidth, but that makes a lot of sense! However, since the metadata and lookup table are located at the end of the file (PDF files are apparently meant to be read from the end, according to the specification), we would probably always need to download the whole file. Still, using the chunked download strategy we could at least put an upper limit in file size and possibly time as well.

When I get a moment, I'll have a look and do a proper review.

No rush, take your time!

Also, if you want to add tests, the way these metadata things are tested at the moment is with locally served files, there is already an test PDF here which could perhaps be used, or modified to have a title usable for testing.

Yes, I've incorporated it in a test that fails to read a title and added a minimal PDF with a title that I made myself (to avoid any pesky copyright issues).

@Jassob
Copy link
Author

Jassob commented Apr 12, 2021

Aside: Yesterday I realized that PDFs typeset by (La)TeX probably won't have the metadata set correctly since it apparently does not set it by default, which means that we'll continue to see a lot of "application/pdf SIZE KB" in our channels..

Oh well, it was a fun exercise at least! 🙂

@nuxeh
Copy link
Owner

nuxeh commented Apr 15, 2021

Haha... Don't know which is more silly, pdflatex not setting title by default, or PDF headers being at EOF...

However, spotted this, which mentions a CiteSeer library/model based on a support vector machine for extracting document metadata, which uses SVM-Light, which in turn, being written in C, could potentially be integrated into Rust efficiently.

An intriguing problem... makes me think that it could be worth some investigation.

Appreciate the LaTeX test file in all its glory, btw :)

I'll fix the clippy error on master and merge shortly, maybe think of refactoring slightly, or adding a fixed upper limit to PDF file sizes.

Thanks again :)

@Jassob
Copy link
Author

Jassob commented Apr 15, 2021

I'll fix the clippy error on master and merge shortly, maybe think of refactoring slightly, or adding a fixed upper limit to PDF file sizes.

Absolutely, feel free to edit the cherry-pick the commit and edit it and so on if you want.

I've been a bit preoccupied with other things lately, but will try to carve out some time for this PR tomorrow if you want me to fix anything (and look into the CiteSeer/SVM-Light thingy)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants