Add parsing of PDF titles #392

Jassob · 2021-04-10T18:04:18Z

This commit attempts to parse the title metadata in a PDF file if it
exists, otherwise it fallbacks to printing only the media type and
size.

We use the pdf crate to parse all of the response body (for some
reason metadata and table of contents are kept at the end of PDFs) and
then ask the PDF for the title that may be defined in the "info
dictionary".

Jassob · 2021-04-10T18:14:39Z

First off, thanks for a splendid bot!

We've been using url-bot-rs for a while to save everybody from having to click on links in order to find what the post was about. However, one thing we thought was missing was the option to print the titles of PDF files (especially when the URL to papers and such often are some kind of ID and not something helpful for us mere humans). So that's where this contribution comes in.

Feel free to request changes or simply close the PR if this is not as minimal as you want url-bot-rs to be :)

codecov · 2021-04-10T18:31:27Z

Codecov Report

Merging #392 (72555b9) into master (b4175be) will increase coverage by 0.25%.
The diff coverage is 96.77%.

@@            Coverage Diff             @@
##           master     #392      +/-   ##
==========================================
+ Coverage   84.37%   84.63%   +0.25%     
==========================================
  Files          10       10              
  Lines        1466     1497      +31     
==========================================
+ Hits         1237     1267      +30     
- Misses        229      230       +1

Impacted Files	Coverage Δ
src/http.rs	`97.34% <87.50%> (-0.24%)`	⬇️
src/title.rs	`99.20% <100.00%> (+0.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4175be...72555b9. Read the comment docs.

Jassob · 2021-04-10T18:48:12Z

Right, I have no tests.. I'll fix that!

nuxeh · 2021-04-12T10:34:51Z

Hey :) this is a perfect addition to url-bot-rs

Actually have thought about adding PDF titles for some time, but never got around to it, so thanks for the PR!

I'll wait for the tests for now, but all looks good to me, except it might be good to rework the HTTP side to make use of the chunked download strategy used currently. As it stands, it appears this will download the entire file, which can have some issues, with e.g. very large files, or even malicious attempts to provide never-ending data over HTTP... believe it or not this has been demonstrated already in the past! (#309)

When I get a moment, I'll have a look and do a proper review. Also, if you want to add tests, the way these metadata things are tested at the moment is with locally served files, there is already an test PDF here which could perhaps be used, or modified to have a title usable for testing.

This commit attempts to parse the title metadata in a PDF file if it exists, otherwise it fallbacks to printing only the media type and size. We use the `pdf` crate to parse all of the response body (for some reason metadata and table of contents are kept at the end of PDFs) and then ask the PDF for the title that may be defined in the "[info dictionary]". [info dictionary]: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

Jassob · 2021-04-12T19:49:03Z

Hey :) this is a perfect addition to url-bot-rs

Actually have thought about adding PDF titles for some time, but never got around to it, so thanks for the PR!

Glad to hear it and happy to help!

I'll wait for the tests for now, but all looks good to me, except it might be good to rework the HTTP side to make use of the chunked download strategy used currently. As it stands, it appears this will download the entire file, which can have some issues, with e.g. very large files, or even malicious attempts to provide never-ending data over HTTP... believe it or not this has been demonstrated already in the past! (#309)

Aha, I see, I thought the chunk implementation was "only" used to speed up downloads and save bandwidth, but that makes a lot of sense! However, since the metadata and lookup table are located at the end of the file (PDF files are apparently meant to be read from the end, according to the specification), we would probably always need to download the whole file. Still, using the chunked download strategy we could at least put an upper limit in file size and possibly time as well.

When I get a moment, I'll have a look and do a proper review.

No rush, take your time!

Also, if you want to add tests, the way these metadata things are tested at the moment is with locally served files, there is already an test PDF here which could perhaps be used, or modified to have a title usable for testing.

Yes, I've incorporated it in a test that fails to read a title and added a minimal PDF with a title that I made myself (to avoid any pesky copyright issues).

Jassob · 2021-04-12T19:54:55Z

Aside: Yesterday I realized that PDFs typeset by (La)TeX probably won't have the metadata set correctly since it apparently does not set it by default, which means that we'll continue to see a lot of "application/pdf SIZE KB" in our channels..

Oh well, it was a fun exercise at least! 🙂

nuxeh · 2021-04-15T07:14:09Z

Haha... Don't know which is more silly, pdflatex not setting title by default, or PDF headers being at EOF...

However, spotted this, which mentions a CiteSeer library/model based on a support vector machine for extracting document metadata, which uses SVM-Light, which in turn, being written in C, could potentially be integrated into Rust efficiently.

An intriguing problem... makes me think that it could be worth some investigation.

Appreciate the LaTeX test file in all its glory, btw :)

I'll fix the clippy error on master and merge shortly, maybe think of refactoring slightly, or adding a fixed upper limit to PDF file sizes.

Thanks again :)

Jassob · 2021-04-15T21:03:00Z

I'll fix the clippy error on master and merge shortly, maybe think of refactoring slightly, or adding a fixed upper limit to PDF file sizes.

Absolutely, feel free to edit the cherry-pick the commit and edit it and so on if you want.

I've been a bit preoccupied with other things lately, but will try to carve out some time for this PR tomorrow if you want me to fix anything (and look into the CiteSeer/SVM-Light thingy)!

Jassob force-pushed the add-pdf-title-parsing branch from 6446b25 to 6cf9270 Compare April 10, 2021 18:20

Jassob force-pushed the add-pdf-title-parsing branch from 6cf9270 to 72555b9 Compare April 12, 2021 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parsing of PDF titles #392

Add parsing of PDF titles #392

Jassob commented Apr 10, 2021

Jassob commented Apr 10, 2021

codecov bot commented Apr 10, 2021 •

edited

Loading

Jassob commented Apr 10, 2021

nuxeh commented Apr 12, 2021

Jassob commented Apr 12, 2021

Jassob commented Apr 12, 2021 •

edited

Loading

nuxeh commented Apr 15, 2021

Jassob commented Apr 15, 2021

Add parsing of PDF titles #392

Are you sure you want to change the base?

Add parsing of PDF titles #392

Conversation

Jassob commented Apr 10, 2021

Jassob commented Apr 10, 2021

codecov bot commented Apr 10, 2021 • edited Loading

Codecov Report

Jassob commented Apr 10, 2021

nuxeh commented Apr 12, 2021

Jassob commented Apr 12, 2021

Jassob commented Apr 12, 2021 • edited Loading

nuxeh commented Apr 15, 2021

Jassob commented Apr 15, 2021

codecov bot commented Apr 10, 2021 •

edited

Loading

Jassob commented Apr 12, 2021 •

edited

Loading