Improve resilience against malformed or corrupt documents #152

packdat · 2024-08-16T21:00:49Z

This PR is intended to allow PDFsharp to read documents that would otherwise throw exceptions when trying to open them.

Notable changes:

Instead of throwing an "Unexpected character"-exception, the offending character is simply ignored.
For example, in one of my test-documents there is a string-value that looks like this: (text))
Note the double closing parenthesis, which can (and should) be just ignored.
When a dictionary with an incorrect stream-length is encountered, the library takes the (wrong) specified length, adds an arbitrary tolerance of 20 bytes and tries to find the end of the stream within that range.
The tolerance of 20 bytes is insufficient in most (if not all) cases.
The method was enhanced to search for the end of the stream until EOF.
In one of my test-documents there are UTF-16LE encoded strings, which triggered a NotImplementedException.
The code that throws was inside an #if true block but the #else block seems to work just fine, so i just switched the condition.
For really corrupt documents (e.g. xref-table cut off at the end, startxref-keyword pointing to something that is not an xref-table, etc.), an attempt is made to rebuild the trailer and the CrossReferenceTable by manually scanning the whole document.

I included a zip-file containing the documents (out of my >1000 test-files), that could not be opened with the original version 6.2.0-preview-1, but opened just fine in Chrome, Edge, Firefox and Acrobat Reader.
With the changes, PDFsharp was able to open them.

DefectFiles.zip

I used the following test-case (in PdfSharp.Tests.IO.ReaderTests):

[Theory]
[InlineData(@"path\to\zip-files\issue #70.PDF", 1)]
[InlineData(@"path\to\zip-files\issue #70 - Copy.PDF", 1)]
[InlineData(@"path\to\zip-files\PdfTrailer_not_null_001(PageCount).pdf", 14)]
[InlineData(@"path\to\zip-files\PdfTrailer_not_null_002(PageCount).pdf", 1)]
[InlineData(@"path\to\zip-files\PdfTrailer_not_null_003(PageCount).pdf", 100)]
[InlineData(@"path\to\zip-files\Unexpected_Token_0x0029(PageCount).pdf", 120)]
[InlineData(@"path\to\zip-files\Unexpected_Token_426(PageCount).pdf", 2)]
[InlineData(@"path\to\zip-files\Unexpected_Token_EmptyChar(PageCount).pdf", 3)]
[InlineData(@"path\to\zip-files\Unexpected_Token_endobj(PageCount).pdf", 6)]
[InlineData(@"path\to\zip-files\Unexpected_Token_SlashE(PageCount).pdf", 7)]
[InlineData(@"path\to\zip-files\Contains UTF16-LE.pdf", 74)]
public void TestSingleFile(string filePath, int expectedPageCount)
{
    File.Exists(filePath).Should().BeTrue("File should exist");
    VerifyPdfCanBeImported(filePath, expectedPageCount);
}

private static void VerifyPdfCanBeImported(string filePath, int expectedPageCount = 0)
{
    var act = () =>
    {
        var document = PdfReader.Open(filePath, PdfDocumentOpenMode.Import);
        var documentCopy = new PdfDocument();
        if (expectedPageCount > 0)
            document.Pages.Count.Should().Be(expectedPageCount);
        foreach (var page in document.Pages)
        {
            documentCopy.AddPage(page);
        }
        documentCopy.Save(Path.Combine(Path.GetTempPath(), "out.pdf"));
    };
    act.Should().NotThrow();
}

Fixes empira#148

Greybird · 2024-08-23T12:36:25Z

@packdat , hi!

Thanks for you work on this PR.

I was looking at an incremental pdf which had the following structure, and which happen to have an incorrect startxref value.

%PDF-1.4
%����
1 0 obj
<</CreationDate(D:20240410120445+00'00')/Creator(Chromium)/ModDate(D:20240410120445+00'00')/Producer(Skia/PDF m123)>>
endobj

trailer
<</Info 1 0 R /Root 11 0 R /Size 36/ID[<F1E95FF43FDCDA25EDF37978CFF88171><F1E95FF43FDCDA25EDF37978CFF88171>]>>
startxref
22142
%%EOF

1 0 obj
<</Length 3339/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:CreateDate>2018-06-11T09:13:35+02:00</xmp:CreateDate>
         <xmp:CreatorTool>RICOH MP C3003</xmp:CreatorTool>
         <xmp:ModifyDate>2018-06-11T09:12:05+02:00</xmp:ModifyDate>
         <xmp:MetadataDate>2018-06-11T09:12:05+02:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>RICOH MP C3003</pdf:Producer>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>uuid:78f05a34-0a1a-423e-a16d-ebedbc471291</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:e959e1a8-d6bd-4b1b-b67d-bdd5f8532045</xmpMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj

Acrobat shows the following information:

Which implies that the trailer is using the initial 1 0 reference for document information.
I confirmed that itext was doing this also.

But the code is currently overwriting the reference used in the trailing with the second occurence of the 1 0 reference, providing me with a dictionary with Length or SubType keys.
If I fix the startxref to point to the appropriate location, then PDFsharp is using the initial 1 0 reference for the document information.
I suppose there is a difference on how this is handled by the code that could benefit from being aligned ?

packdat · 2024-08-28T16:19:42Z

@Greybird
If i understand correctly, the PDF with the wrong startxref value triggered PdfTrailer.Rebuild, which finds object 1 0 more than once and uses the (incorrect) last one ?
This is currently by design; the Rebuild method always uses the last object found (scanning the document from start to end), as that is typically the correct one, especially in the case on incremental updates.

Can you provide the mentioned document or one that shows the same behavior ?
Maybe there are clues that allows the Rebuild to be performed a little bit smarter...

haexyh · 2024-09-24T13:54:13Z

Could I asking you kindly for an update, we would love to have this pr?

packdat added 4 commits August 16, 2024 22:06

Improve resilience against malformed or corrupt documents

fb03e32

Fix comment

6567620

Do not include eol-marker in stream-data

4f1ca61

Handle case where reported stream-length + position exceeds pdf-length.

48c4e6d

Fixes empira#148

packdat mentioned this pull request Aug 17, 2024

System.InvalidOperationException: 'Stream cannot be read. Please send us the PDF file so that we can fix this (issues (at) pdfsharp.net).' #148

Closed

ThomasHoevel added the investigation Under investigation label Dec 12, 2024

packdat mentioned this pull request Dec 12, 2024

6.2.0 Preview 2 cannot read certain files #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resilience against malformed or corrupt documents #152

Improve resilience against malformed or corrupt documents #152

packdat commented Aug 16, 2024

Greybird commented Aug 23, 2024

packdat commented Aug 28, 2024

haexyh commented Sep 24, 2024 •

edited

Loading

Improve resilience against malformed or corrupt documents #152

Are you sure you want to change the base?

Improve resilience against malformed or corrupt documents #152

Conversation

packdat commented Aug 16, 2024

Greybird commented Aug 23, 2024

packdat commented Aug 28, 2024

haexyh commented Sep 24, 2024 • edited Loading

haexyh commented Sep 24, 2024 •

edited

Loading