FEATURE: Extract Paragraphs #524

judy · 2023-10-25T20:11:29Z

We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via page.text was still difficult for us to programmatically parse.

I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.

I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.

To-do:

Implement type checking via Sorbet
Add tests for DisjointSet

judy · 2023-11-14T15:53:08Z

I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now!

yob

Thanks for the well written submission ❤️

I've been flat out, but I'll try and take a look soon.

It's definitely the case that Page#text isn't very versatile, returning all the text in plain text with no markup.

There is Page#runs which returns a lower level view of the text of a page, including positioning data. I've flip flopped over time on how much I want to add to PDF::Reader directly, and how much I want to encourage folks to build their own code on top of Page#runs. I'll take this for a spin and see how it feels though, cheers!

Cosmo · 2024-02-10T16:10:17Z

@yob I hope you're well!
Any updates on this PR?

Thanks! :)

judy added 16 commits October 25, 2023 16:03

Group text runs into paragraphs

677bf00

Move DisjointSet to file

23e0929

Move paragraph struct up to where it's being used.

8c7cd4b

Refactor

7eb4142

Strip whitespace before joining into paragraphs

53191bc

Use font size difference to keep headlines apart from paragraphs

913715d

Add test for multi-column layout

f209829

Remove unimplemented method

de6266e

First pass at implementing types for DisjointSet

b371679

Move Paragraph to dedicated class

df88380

More type fixes

0071adb

Fix spec and paragraph init

1a0b993

More type fixes

0114883

Typecheck horizontal_overlap

dde1100

Refactor, more type fixes

8e39137

Add disjoint set spec

7d78006

judy marked this pull request as ready for review November 13, 2023 22:06

judy added 2 commits November 14, 2023 10:51

Add missing comment on purpose of Paragraph class

c26942c

Fix Elem type in DisjointSet

9734484

Clean up old comment

8c7e965

yob reviewed Nov 18, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: Extract Paragraphs #524

FEATURE: Extract Paragraphs #524

judy commented Oct 25, 2023 •

edited

Loading

judy commented Nov 14, 2023

yob left a comment

Cosmo commented Feb 10, 2024

FEATURE: Extract Paragraphs #524

Are you sure you want to change the base?

FEATURE: Extract Paragraphs #524

Conversation

judy commented Oct 25, 2023 • edited Loading

judy commented Nov 14, 2023

yob left a comment

Choose a reason for hiding this comment

Cosmo commented Feb 10, 2024

judy commented Oct 25, 2023 •

edited

Loading