[BUG] - Stream: Area detection hangs on PDF page #30

kirk-marple · 2024-01-05T01:58:31Z

Describe the bug
When attempting to extract tables from this 250+ page PDF, I found that it hangs on a specific page (98), in the 'Detect' method.

To Reproduce
Using 40927R03.pdf

I've tried with 0.1.3 and 0.1.4-alpha001, and got hang in same spot.

Using .NET 6.0, C#.

using var pdoc = PdfDocument.Open(content.Stream, new ParsingOptions { SkipMissingFonts = true, UseLenientParsing = true });
var da = new Tabula.Detectors.SimpleNurminenDetectionAlgorithm();

var area = Tabula.ObjectExtractor.ExtractPage(pdoc, 98 /* hangs on this page */);
var regions = da.Detect(area); <-- this line hangs

Expected behavior
To properly parse all tables.

The text was updated successfully, but these errors were encountered:

andyesys · 2024-03-15T08:43:49Z

I have also encountered this hang and had to stop using this library unfortunately.

mikelor · 2024-07-19T22:18:55Z

@kirk-marple, don't have an answer necessarily, but it looks like it's having a problem with the graphs on that page.
This bit of code fails to remove TextRows from the collection, and so it gets into an infinite loop.

In that section the following if condition never resolves to True, meaning that the Table it has found doesn't contain the textRow. This is because the textRow Left and Bottom values extend beyond the Table's bounds.

There's a lot of bounds detection code in this algorithm that needs more looking at, but I don't have the time. The simplest thing to do would be to "skip" page 98. (which has a page number of 87 at the bottom of the page in the pdf). It has two graphs on the page and that may be complicating the algorithm.

Good luck.

    if (table.Contains(textRow))
    {
        lines.Remove(textRow);
        break;
    }

LuisM000 · 2024-08-01T07:37:47Z

Same issue here :(

mikelor · 2024-08-01T13:21:25Z

Are you willing to share the PDF that this is occurring on?

In the case of the OP it was on a page that contained a somewhat complicated graph.

LuisM000 · 2024-08-05T06:59:33Z

Yes, here is an example where the issue occurs.
issue.pdf

mikelor · 2024-08-06T16:06:49Z

@LuisM000, thanks. I found the area of code where the Detect logic is hanging. There appears to be an infinite do/while loop that for certain cases is always true. I've drafted Pull Request #33 and created a sample project mikelor/tabulate that has a simple fix, but it hasn't fully been tested yet.

I mainly use the spreadsheet detection algorithm, and use the ExtractionAlgorithms more than the Detectors. Do you have some sample code that you can share?

See mikelor/tsathroughput for an exaple using Extractors vs Detectors.

LuisM000 · 2024-08-07T06:42:16Z

Hi @mikelor! Thanks for helping to try and fix the bug :)

Yes, I noticed that it hangs in the do/while loop, but I haven't had time to investigate further and it's quite complex code. It looks like the change you made in your PR doesn't break any tests, but unfortunately, I don't have a comprehensive test suite to verify if this change affects detection in any cases.

In my case, I use the SimpleNurminenDetectionAlgorithm and utilize both extraction and detection algorithms. I'm not doing anything more sophisticated than what's shown in the example (https://github.com/mikelor/tabulate/blob/da82a3739e3704880bbf90d0d8b2724dc7a480a2/Program.cs#L18).

mikelor added a commit to mikelor/tabula-sharp that referenced this issue Aug 6, 2024

Investigating [BUG] - Stream: Area detection hangs on PDF page BobLd#30

89fb0ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Stream: Area detection hangs on PDF page #30

[BUG] - Stream: Area detection hangs on PDF page #30

kirk-marple commented Jan 5, 2024

andyesys commented Mar 15, 2024

mikelor commented Jul 19, 2024

LuisM000 commented Aug 1, 2024

mikelor commented Aug 1, 2024

LuisM000 commented Aug 5, 2024

mikelor commented Aug 6, 2024

LuisM000 commented Aug 7, 2024

[BUG] - Stream: Area detection hangs on PDF page #30

[BUG] - Stream: Area detection hangs on PDF page #30

Comments

kirk-marple commented Jan 5, 2024

andyesys commented Mar 15, 2024

mikelor commented Jul 19, 2024

LuisM000 commented Aug 1, 2024

mikelor commented Aug 1, 2024

LuisM000 commented Aug 5, 2024

mikelor commented Aug 6, 2024

LuisM000 commented Aug 7, 2024