System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

andresdbv · 2024-01-11T20:15:56Z

I have been using for years 1.50.5147.0 to open pdfs and add some text to them. The thing is that it has trouble opening some files with PdfReader.Open(pathFilePDF, PdfDocumentOpenMode.Modify) (I've read about this and that I should save them again - I print them to pdf again with chrome and it works). As this is a service in the backend it gives us a lot of trouble. Some of the problematic pdfs are scans of physical documents and others are pdf saved from Autocad.

Our solution uses .net framework 4.72, so when I read that 6.1.0-preview-1 could be used in my project I thought to give it a try and set up a little project to see if it could handle the files that give us problems, to no avail.

So to check if I was doing something wrong I fired up a project in net.6 with 6.0.0 version of the library and it worked.

Expected Behavior

It should open the pdf attached in the Issue Template Project (File.pdf)

Actual Behavior

It gives me this error: System.AggregateException: 'Invalid object ID.'

Steps to Reproduce the Behavior

Here is the template, the code is just to try to open the pdf file I attached within the solution

Issue.zip

(if you can at least help me to know what its wrong with the pdf would be awesome)

The text was updated successfully, but these errors were encountered:

packdat · 2024-01-14T17:43:18Z

The PDF has some interesting properties...

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0).
This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

It reads all regular objects (objects stored directly in the PDF)
It reads all objects stored in object-streams

The exception is thrown in phase 1.:
The mentioned stream-object is a regular object.
When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

While the PDF-spec states that objects representing the /Length of object-streams shall not be stored in an object-stream, it says nothing about objects representing the /Length of ordinary stream-objects.
So at least regarding the spec, the PDF seems perfectly fine.

That said, it's the first time i encountered a PDF doing this and I've seen a lot of PDFs over the last >10 years !

Because it does not seem to be common practice to store objects like that, a fix handling this specific case should be straightforward.
As there are many PDFs out there with incorrectly specified /Length entries for stream-objects, the case mentioned here should be treated as a special case of an incorrectly specified stream-length.
The library should use a fallback in these cases and should attempt to locate the endstream-keyword manually.

@ThomasHoevel I already have code doing this, tested successfully with the attached PDF.
I could open a PR (unless you want to consider incorrect/missing stream-lengths as invalid PDF)

On the long run, the library should be adapted to read objects in a way that can locate referenced objects regardless of their location, i.e. whether they are stored on the file-level or in object-streams.
(think of Font-objects storing their Descriptor inside object-streams...)

TH-Soft · 2024-01-15T08:02:11Z

@packdat Thanks for your feedback and analysis.
PDFsharp 1.5 did not read object streams at all. Support for object streams was added to an old architecture and there are some issues coming from that.
Stefan has to decide about the PR. We will look at it if you create it.

andresdbv · 2024-01-16T13:15:03Z

Hi @packdat, thanks for your help.

I have had trouble with similar files that are scanned, right now I don't have any example of the PDFs that are converted from .dwg files that throw me errors when trying to open with PDFSharp too.

Is there any tip you can give me please to relay to the people who are scanning these documents that way we can avoid this problem and have a clean and safe PDF to use with PDFSharp?

packdat · 2024-01-16T20:00:25Z

Hi @andresdbv

The metadata of the PDF states: <pdf:Producer>PDFlib 8.0.0 (Win32)</pdf:Producer>
This seems to be the library used to create the PDF.
I would start by asking them if they could update this library, try different parameters when creating the PDF or using a different library altogether.

If all they do is convert images obtained from a scanner to PDF, you could write your own little tool based of PDFsharp that does the conversion and let them use that.
That should definitively create compatible PDFs 😉

Audionysos · 2024-03-01T02:49:46Z

@packdat

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0). This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

It reads all regular objects (objects stored directly in the PDF)

It reads all objects stored in object-streams

The exception is thrown in phase 1.: The mentioned stream-object is a regular object. When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

Hi, I've just downloaded recent ISO spec 2 days ago and I'm slowly reading it... I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case. In 7.3.10 it's stated:

An indirect reference to an undefined object shall not be considered an error by a PDF processor; it
shall be treated as a reference to the null object.
EXAMPLE 2
If a file contains the indirect reference 17 0 R but does not contain the corresponding definition then the
indirect reference is considered to refer to the null object.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

And about your point(step) 2 - I believe the PDF was specifically designed so there is no need to load whole thing into memory. In version 1.0 at https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf it's mentioned several times that the format should allow reading only single page (from a big document) and should be suitable for devices with low memory...
I believe only top objects should be loaded at the beginning and the streams should be extracted as you go. A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used but before that it's just null reference.

Does this make sense? Again, I didn't even go through half of the specs, so sorry if I made some confusion, but I'm also curious about this.

packdat · 2024-03-02T11:09:26Z

I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case.

I'm not thowing anything, the PR attempts to avoid that.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

Correct.
But even if the library would return a null object, it would be wrong also, because the object is not missing, it's just not known yet.

I believe the PDF was specifically designed so there is no need to load whole thing into memory.

Are we talking about PDF in general or about the specific PDF mentioned here ?
If we're talking about the latter, i have to disagree.
From an efficiency standpoint, this comes close to "worst case" IMO.
Why burying a stream-length in a object-stream that needs to be located, unpacked, and parsed to extract a single integer value when you could store that value as a direct object in the /Length property ?

I believe only top objects should be loaded at the beginning and the streams should be extracted as you go.

That's exactly, what #85 attempts to do.

A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used

In a perfect world, that would be the case.
But PDF writers are free to store their objects wherever they please, as long as they obey the spec.

sorry if I made some confusion

No worries. PDF (and "flavors" thereof) are a sometimes confusing matter.

packdat added a commit to packdat/PDFsharp-net6 that referenced this issue Jan 15, 2024

Handle incorrect/missing length of stream-objects. Fixes empira#73

1d0b96b

packdat added a commit to packdat/PDFsharp-net6 that referenced this issue Jan 15, 2024

Handle incorrect/missing length of stream-objects. Fixes empira#73

ad5b7df

packdat mentioned this issue Jan 15, 2024

Handle incorrect/missing length of stream-objects. #74

Closed

This was referenced Feb 1, 2024

Lazy-load object-streams and their objects #84

Closed

Use lazy loading for object-streams and their objects #85

Open

packdat added a commit to packdat/PDFsharp-net6 that referenced this issue Feb 4, 2024

Handle incorrect/missing length of stream-objects. Fixes empira#73

cbc451b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

andresdbv commented Jan 11, 2024 •

edited

Loading

packdat commented Jan 14, 2024

TH-Soft commented Jan 15, 2024

andresdbv commented Jan 16, 2024

packdat commented Jan 16, 2024

Audionysos commented Mar 1, 2024

packdat commented Mar 2, 2024

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

Comments

andresdbv commented Jan 11, 2024 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Behavior

packdat commented Jan 14, 2024

TH-Soft commented Jan 15, 2024

andresdbv commented Jan 16, 2024

packdat commented Jan 16, 2024

Audionysos commented Mar 1, 2024

packdat commented Mar 2, 2024

andresdbv commented Jan 11, 2024 •

edited

Loading