Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs #73

Open
andresdbv opened this issue Jan 11, 2024 · 6 comments

Comments

@andresdbv
Copy link

andresdbv commented Jan 11, 2024

I have been using for years 1.50.5147.0 to open pdfs and add some text to them. The thing is that it has trouble opening some files with PdfReader.Open(pathFilePDF, PdfDocumentOpenMode.Modify) (I've read about this and that I should save them again - I print them to pdf again with chrome and it works). As this is a service in the backend it gives us a lot of trouble. Some of the problematic pdfs are scans of physical documents and others are pdf saved from Autocad.

Our solution uses .net framework 4.72, so when I read that 6.1.0-preview-1 could be used in my project I thought to give it a try and set up a little project to see if it could handle the files that give us problems, to no avail.

So to check if I was doing something wrong I fired up a project in net.6 with 6.0.0 version of the library and it worked.

Expected Behavior

It should open the pdf attached in the Issue Template Project (File.pdf)

Actual Behavior

It gives me this error: System.AggregateException: 'Invalid object ID.'

Steps to Reproduce the Behavior

Here is the template, the code is just to try to open the pdf file I attached within the solution

Issue.zip

(if you can at least help me to know what its wrong with the pdf would be awesome)

@packdat
Copy link

packdat commented Jan 14, 2024

The PDF has some interesting properties...

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0).
This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

  1. It reads all regular objects (objects stored directly in the PDF)
  2. It reads all objects stored in object-streams

The exception is thrown in phase 1.:
The mentioned stream-object is a regular object.
When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

While the PDF-spec states that objects representing the /Length of object-streams shall not be stored in an object-stream, it says nothing about objects representing the /Length of ordinary stream-objects.
So at least regarding the spec, the PDF seems perfectly fine.

That said, it's the first time i encountered a PDF doing this and I've seen a lot of PDFs over the last >10 years !

Because it does not seem to be common practice to store objects like that, a fix handling this specific case should be straightforward.
As there are many PDFs out there with incorrectly specified /Length entries for stream-objects, the case mentioned here should be treated as a special case of an incorrectly specified stream-length.
The library should use a fallback in these cases and should attempt to locate the endstream-keyword manually.

@ThomasHoevel I already have code doing this, tested successfully with the attached PDF.
I could open a PR (unless you want to consider incorrect/missing stream-lengths as invalid PDF)

On the long run, the library should be adapted to read objects in a way that can locate referenced objects regardless of their location, i.e. whether they are stored on the file-level or in object-streams.
(think of Font-objects storing their Descriptor inside object-streams...)

@TH-Soft
Copy link

TH-Soft commented Jan 15, 2024

@packdat Thanks for your feedback and analysis.
PDFsharp 1.5 did not read object streams at all. Support for object streams was added to an old architecture and there are some issues coming from that.
Stefan has to decide about the PR. We will look at it if you create it.

@andresdbv
Copy link
Author

Hi @packdat, thanks for your help.

I have had trouble with similar files that are scanned, right now I don't have any example of the PDFs that are converted from .dwg files that throw me errors when trying to open with PDFSharp too.

Is there any tip you can give me please to relay to the people who are scanning these documents that way we can avoid this problem and have a clean and safe PDF to use with PDFSharp?

@packdat
Copy link

packdat commented Jan 16, 2024

Hi @andresdbv

The metadata of the PDF states: <pdf:Producer>PDFlib 8.0.0 (Win32)</pdf:Producer>
This seems to be the library used to create the PDF.
I would start by asking them if they could update this library, try different parameters when creating the PDF or using a different library altogether.

If all they do is convert images obtained from a scanner to PDF, you could write your own little tool based of PDFsharp that does the conversion and let them use that.
That should definitively create compatible PDFs 😉

@Audionysos
Copy link

@packdat

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0). This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

  1. It reads all regular objects (objects stored directly in the PDF)
  2. It reads all objects stored in object-streams

The exception is thrown in phase 1.: The mentioned stream-object is a regular object. When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

Hi, I've just downloaded recent ISO spec 2 days ago and I'm slowly reading it... I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case. In 7.3.10 it's stated:

An indirect reference to an undefined object shall not be considered an error by a PDF processor; it
shall be treated as a reference to the null object.
EXAMPLE 2
If a file contains the indirect reference 17 0 R but does not contain the corresponding definition then the
indirect reference is considered to refer to the null object.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

And about your point(step) 2 - I believe the PDF was specifically designed so there is no need to load whole thing into memory. In version 1.0 at https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf it's mentioned several times that the format should allow reading only single page (from a big document) and should be suitable for devices with low memory...
I believe only top objects should be loaded at the beginning and the streams should be extracted as you go. A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used but before that it's just null reference.

Does this make sense? Again, I didn't even go through half of the specs, so sorry if I made some confusion, but I'm also curious about this.

@packdat
Copy link

packdat commented Mar 2, 2024

I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case.

I'm not thowing anything, the PR attempts to avoid that.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

Correct.
But even if the library would return a null object, it would be wrong also, because the object is not missing, it's just not known yet.

I believe the PDF was specifically designed so there is no need to load whole thing into memory.

Are we talking about PDF in general or about the specific PDF mentioned here ?
If we're talking about the latter, i have to disagree.
From an efficiency standpoint, this comes close to "worst case" IMO.
Why burying a stream-length in a object-stream that needs to be located, unpacked, and parsed to extract a single integer value when you could store that value as a direct object in the /Length property ?

I believe only top objects should be loaded at the beginning and the streams should be extracted as you go.

That's exactly, what #85 attempts to do.

A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used

In a perfect world, that would be the case.
But PDF writers are free to store their objects wherever they please, as long as they obey the spec.

sorry if I made some confusion

No worries. PDF (and "flavors" thereof) are a sometimes confusing matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants