-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE: Extract Paragraphs #524
base: main
Are you sure you want to change the base?
Conversation
I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the well written submission ❤️
I've been flat out, but I'll try and take a look soon.
It's definitely the case that Page#text
isn't very versatile, returning all the text in plain text with no markup.
There is Page#runs
which returns a lower level view of the text of a page, including positioning data. I've flip flopped over time on how much I want to add to PDF::Reader
directly, and how much I want to encourage folks to build their own code on top of Page#runs
. I'll take this for a spin and see how it feels though, cheers!
@yob I hope you're well! Thanks! :) |
We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via
page.text
was still difficult for us to programmatically parse.I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.
I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.
To-do: