-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parsing of PDF titles #392
base: master
Are you sure you want to change the base?
Conversation
First off, thanks for a splendid bot! We've been using url-bot-rs for a while to save everybody from having to click on links in order to find what the post was about. However, one thing we thought was missing was the option to print the titles of PDF files (especially when the URL to papers and such often are some kind of ID and not something helpful for us mere humans). So that's where this contribution comes in. Feel free to request changes or simply close the PR if this is not as minimal as you want url-bot-rs to be :) |
6446b25
to
6cf9270
Compare
Codecov Report
@@ Coverage Diff @@
## master #392 +/- ##
==========================================
+ Coverage 84.37% 84.63% +0.25%
==========================================
Files 10 10
Lines 1466 1497 +31
==========================================
+ Hits 1237 1267 +30
- Misses 229 230 +1
Continue to review full report at Codecov.
|
Right, I have no tests.. I'll fix that! |
Hey :) this is a perfect addition to Actually have thought about adding PDF titles for some time, but never got around to it, so thanks for the PR! I'll wait for the tests for now, but all looks good to me, except it might be good to rework the HTTP side to make use of the chunked download strategy used currently. As it stands, it appears this will download the entire file, which can have some issues, with e.g. very large files, or even malicious attempts to provide never-ending data over HTTP... believe it or not this has been demonstrated already in the past! (#309) When I get a moment, I'll have a look and do a proper review. Also, if you want to add tests, the way these metadata things are tested at the moment is with locally served files, there is already an test PDF here which could perhaps be used, or modified to have a title usable for testing. |
This commit attempts to parse the title metadata in a PDF file if it exists, otherwise it fallbacks to printing only the media type and size. We use the `pdf` crate to parse all of the response body (for some reason metadata and table of contents are kept at the end of PDFs) and then ask the PDF for the title that may be defined in the "[info dictionary]". [info dictionary]: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
6cf9270
to
72555b9
Compare
Glad to hear it and happy to help!
Aha, I see, I thought the chunk implementation was "only" used to speed up downloads and save bandwidth, but that makes a lot of sense! However, since the metadata and lookup table are located at the end of the file (PDF files are apparently meant to be read from the end, according to the specification), we would probably always need to download the whole file. Still, using the chunked download strategy we could at least put an upper limit in file size and possibly time as well.
No rush, take your time!
Yes, I've incorporated it in a test that fails to read a title and added a minimal PDF with a title that I made myself (to avoid any pesky copyright issues). |
Aside: Yesterday I realized that PDFs typeset by (La)TeX probably won't have the metadata set correctly since it apparently does not set it by default, which means that we'll continue to see a lot of "application/pdf SIZE KB" in our channels.. Oh well, it was a fun exercise at least! 🙂 |
Haha... Don't know which is more silly, However, spotted this, which mentions a CiteSeer library/model based on a support vector machine for extracting document metadata, which uses SVM-Light, which in turn, being written in C, could potentially be integrated into Rust efficiently. An intriguing problem... makes me think that it could be worth some investigation. Appreciate the LaTeX test file in all its glory, btw :) I'll fix the clippy error on master and merge shortly, maybe think of refactoring slightly, or adding a fixed upper limit to PDF file sizes. Thanks again :) |
Absolutely, feel free to edit the cherry-pick the commit and edit it and so on if you want. I've been a bit preoccupied with other things lately, but will try to carve out some time for this PR tomorrow if you want me to fix anything (and look into the CiteSeer/SVM-Light thingy)! |
This commit attempts to parse the title metadata in a PDF file if it
exists, otherwise it fallbacks to printing only the media type and
size.
We use the
pdf
crate to parse all of the response body (for somereason metadata and table of contents are kept at the end of PDFs) and
then ask the PDF for the title that may be defined in the "info
dictionary".