Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/A: Integrate PDF/A-3 into GHA checks using veraPDF Docker container #265

Closed
ronaldtse opened this issue Sep 1, 2024 · 14 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Sep 1, 2024

From @petervwyatt at:

Will be used by:

I ran ISO 19005-4 through the veraPDF Docker container and it is very very close! I obviously ignored errors about missing metadata. Since MN generates tagging you should select "PDF/A-3u" or, best, "PDF/A-3a" since both of these are better than "PDF/A-3b" (B= basic). Looks to me like a few simple tweaks and you'd be good...

We want to create a GHA workflow that uses the veraPDF container.

Quoted:

To validate your local files you need to add folder with files to the docker container. To run the veraPDF rest image with your local files run docker image with bind mount -v /local/path/of/the/folder:/home/folder. For example, to run the veraPDF rest image from DockerHub with your local files:

$ docker run -d -p 8080:8080 -p 8081:8081 -v /local/path/of/the/folder:/home/folder verapdf/rest:latest

To obtain XML:

curl -F "file=@veraPDF-corpus/PDF_A-1b/6.1 File structure/6.1.12 Implementation limits/veraPDF test suite 6-1-12-t01-fail-a.pdf" localhost:8080/api/validate/1b -H "Accept:application/xml"
@FullStackIndie
Copy link

Hi, I seen the Upwork job post for this task and am posting here just to get some clarification.

I have read through the docs a bit for veraPDF Docker Image. The profiles available (with latest dicker image) that seem to match your interests are PDF/A-3U, PDF/A-3A, PDF/A-3B. I figure PDF/A-3A more closely matches what you want but just making sure

I am a little confused on the process though. Am I uploading (docker volume mount) any random PDF (that I own or find on internet) straight into veraPDF and expecting to get the XML as the test result. Also do you want the XML logged just to console or exported as a GitHub Artifact that can be viewed separately from the GitHub Action run. I can also send it to a rest endpoint or s3 bucket, etc..

@alex-sc
Copy link

alex-sc commented Sep 1, 2024

Hello

I'm assuming you want to validate some PDFs produced by mn2pdfTests.java.

I've added a skeleton here (see the very bottom of the file) https://github.com/alex-sc/mn2pdf/blob/main/.github/workflows/test.yml

      # Generate test PDFs
      - run: mvn test

      - run: |
          docker run -d -p 8080:8080 -v ./target:/home/folder verapdf/rest:latest
          sleep 5
          curl -F "url=file:///home/folder/G.191.pdf" localhost:8080/api/validate/url/A-3A

Looks like there's another solution - integrate the PDF validation right in the Java test by importing the VeraPDF library into the project and doing the validation there, but I didn't check this approach further

@Intelligent2013
Copy link
Contributor

Test PDF sample: test_attachments.tc1.pdf

curl -F "file=@test_attachments.tc1.pdf" localhost:8080/api/validate/3a -H "Accept:application/xml" > res.xml

Report:

<?xml version='1.0' encoding='utf-8'?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.26.1" buildDate="2024-05-16T16:30:00Z"/>
    <releaseDetails id="verapdf-rest" version="1.26.1" buildDate="2024-05-24T15:12:00Z"/>
    <releaseDetails id="validation-model" version="1.26.1" buildDate="2024-05-16T18:12:00Z"/>
  </buildInformation>
  <jobs>
    <job>
      <item size="73771">
        <name>test_attachments.tc1.pdf</name>
      </item>
      <validationReport jobEndStatus="normal" profileName="PDF/A-3A validation profile" statement="PDF file is not compliant with Validation Profile requirements." isCompliant="false">
        <details passedRules="147" failedRules="7" passedChecks="20339" failedChecks="103">
          <rule specification="ISO 19005-3:2012" clause="6.6.2.3.1" testNumber="1" status="failed" failedChecks="1">
            <description>All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, ISO 19005-1 or this part of ISO 19005, or any extension schemas that comply with 6.6.2.3.2</description>
            <object>XMPProperty</object>
            <test>isPredefinedInXMP2005 == true || isDefinedInMainPackage == true || isDefinedInCurrentPackage == true</test>
            <check status="failed">
              <context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]/Properties[9](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.2.4.3" testNumber="4" status="failed" failedChecks="86">
            <description>DeviceGray shall only be used if a device independent DefaultGray colour space has been set when the DeviceGray colour space is used, or if a PDF/A OutputIntent is present</description>
            <object>PDDeviceGray</object>
            <test>gOutputCS != null</test>
            <check status="failed">
              <context>root/document[0]/pages[0](26 0 obj PDPage)/contentStream[0](24 0 obj PDContentStream)/operators[22]/colorSpace[0]</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.5.1" testNumber="1" status="failed" failedChecks="1">
            <description>The Launch, Sound, Movie, ResetForm, ImportData, Hide, SetOCGState, Rendition, Trans, GoTo3DView and JavaScript actions shall not be permitted. Additionally, the deprecated set-state and no-op actions shall not be permitted</description>
            <object>PDAction</object>
            <test>S == "GoTo" || S == "GoToR" || S == "GoToE" || S == "Thread" || S == "URI" || S == "Named" || S == "SubmitForm"</test>
            <check status="failed">
              <context>root/document[0]/pages[4](63 0 obj PDPage)/annots[0](65 0 obj PDLinkAnnot)/A[0](64 0 obj PDAction)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.2.11.4.2" testNumber="2" status="failed" failedChecks="4">
            <description>If the FontDescriptor dictionary of an embedded CID font contains a CIDSet stream, then it shall identify all CIDs which are present in the font program, regardless of whether a CID in the font is referenced or used by the PDF or not</description>
            <object>PDCIDFont</object>
            <test>fontFile_size == 0 || fontName.search(/[A-Z]{6}\+/) != 0 || containsCIDSet == false || cidSetListsAllGlyphs == true</test>
            <check status="failed">
              <context>root/document[0]/pages[0](26 0 obj PDPage)/contentStream[0](24 0 obj PDContentStream)/operators[254]/font[0](EAAAAB+Inter-Bold)/DescendantFonts[0](EAAAAB+Inter-Bold)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.6.2.3.1" testNumber="2" status="failed" failedChecks="6">
            <description>All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, ISO 19005-1 or this part of ISO 19005, or any extension schemas that comply with 6.6.2.3.2</description>
            <object>XMPProperty</object>
            <test>isValueTypeCorrect == true</test>
            <check status="failed">
              <context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]/Properties[4](http://purl.org/dc/elements/1.1/ - dc:title)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.3.2" testNumber="1" status="failed" failedChecks="4">
            <description>Except for annotation dictionaries whose Subtype value is Popup, all annotation dictionaries shall contain the F key</description>
            <object>PDAnnot</object>
            <test>Subtype == "Popup" || F != null</test>
            <check status="failed">
              <context>root/document[0]/pages[1](48 0 obj PDPage)/annots[0](33 0 obj PDLinkAnnot)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-3:2012" clause="6.6.4" testNumber="1" status="failed" failedChecks="1">
            <description>The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema</description>
            <object>MainXMPPackage</object>
            <test>Identification_size == 1</test>
            <check status="failed">
              <context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]</context>
            </check>
          </rule>
        </details>
      </validationReport>
      <duration start="1725217823013" finish="1725217823924">00:00:00.911</duration>
    </job>
  </jobs>
  <batchSummary totalJobs="1" failedToParse="0" encrypted="0" outOfMemory="0" veraExceptions="0">
    <validationReports compliant="0" nonCompliant="1" failedJobs="0">1</validationReports>
    <featureReports failedJobs="0">0</featureReports>
    <repairReports failedJobs="0">0</repairReports>
    <duration start="1725217822998" finish="1725217823946">00:00:00.948</duration>
  </batchSummary>
</report>

@petervwyatt
Copy link

petervwyatt commented Sep 2, 2024

VeraPDF does have a full CLI interface too - see the CLI doco if that is easier. You'll need Java.

Note that veraPDF messages can also be a bit technical... and this may be some of what FOP corrects/adds when you enable PDF/A mode (e.g. hopefully it will see the use of DeviceGray and then add an Output Intent profile for you). And the metadata issues should obviously get corrected too.

@petervwyatt
Copy link

petervwyatt commented Sep 2, 2024

To answer @FullStackIndie's questions:

I have read through the docs a bit for veraPDF Docker Image. The profiles available (with latest dicker image) that seem to match your interests are PDF/A-3U, PDF/A-3A, PDF/A-3B. I figure PDF/A-3A more closely matches what you want but just making sure

Those with disabilities or needing to use assistive technologies (screen readers, screen magnifiers, etc) require that the PDFs generated are Tagged PDF. This means that also making it PDF/A will exceed PDF/A-3B (B = "basic") so don't even bother with that setting. The choice is then PDF/A-3u ("Unicode") or PDF/A-3a ("accessible"). PDF/A-3a is by far the better choice since it preserves the document’s logical structure and content text stream in reading order which is also what PDF/UA and general accessibility require. So please strive for PDF/A-3a.

I am a little confused on the process though. Am I uploading (docker volume mount) any random PDF (that I own or find on internet) straight into veraPDF and expecting to get the XML as the test result. Also do you want the XML logged just to console or exported as a GitHub Artifact that can be viewed separately from the GitHub Action run. I can also send it to a rest endpoint or s3 bucket, etc.

Yes, veraPDF can check any random PDF but will subsequently generate error messages about missing metadata, since all PDF subsets define their conformance via their metadata. veraPDF's default behaviour ("Auto") is to check the metadata and then check whatever conformance level it finds there (see also this veraPDF issue to support multiple conformance levels). In the case of a random PDF, there will be no conformance-level info in the XMP metadata so you'll need to manually set which PDF-flavour you want and expect errors about missing metadata - but any other failures reported are valid.

As mentioned above, veraPDF also has a comprehensive CLI if that is easier than a Docker container. It needs Java.

@Intelligent2013
Copy link
Contributor

@ronaldtse do we need to use the veraPDF Docker container, or would be better to integrate veraPDF into mn2pdf?

I've tried to integrate the veraPDF directly into the mn2pdf application (not released yet).
I study how to convert the checking result from:

ValidationResult [flavour=3a, 
totalAssertions=20438, 
assertions=[TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.5.1, 
testNumber=1], 
status=failed, 
message=The Launch, 
Sound, 
Movie, 
ResetForm, 
ImportData, 
Hide, 
SetOCGState, 
Rendition, 
Trans, 
GoTo3DView and JavaScript actions shall not be permitted. Additionally, 
the deprecated set-state and no-op actions shall not be permitted, 
location=Location [level=CosDocument, 
context=root/document[0]/pages[4](64 0 obj PDPage)/annots[0](66 0 obj PDLinkAnnot)/A[0](65 0 obj PDAction)], 
locationContext=null, 
errorMessage=null], 
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.6.2.3.1, 
testNumber=2], 
status=failed, 
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, 
ISO 19005-1 or this part of ISO 19005, 
or any extension schemas that comply with 6.6.2.3.2, 
location=Location [level=CosDocument, 
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[4](http://purl.org/dc/elements/1.1/ - dc:title)], 
locationContext=null, 
errorMessage=null], 
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.6.2.3.1, 
testNumber=2], 
status=failed, 
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, 
ISO 19005-1 or this part of ISO 19005, 
or any extension schemas that comply with 6.6.2.3.2, 
location=Location [level=CosDocument, 
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[6](http://purl.org/dc/elements/1.1/ - dc:description)], 
locationContext=null, 
errorMessage=null], 
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.6.2.3.1, 
testNumber=2], 
status=failed, 
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, 
ISO 19005-1 or this part of ISO 19005, 
or any extension schemas that comply with 6.6.2.3.2, 
location=Location [level=CosDocument, 
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[7](http://purl.org/dc/elements/1.1/ - dc:creator)], 
locationContext=null, 
errorMessage=null], 
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.6.2.3.1, 
testNumber=2], 
status=failed, 
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, 
ISO 19005-1 or this part of ISO 19005, 
or any extension schemas that comply with 6.6.2.3.2, 
location=Location [level=CosDocument, 
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[11](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)], 
locationContext=null, 
errorMessage=null], 
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012, 
clause=6.6.2.3.1, 
testNumber=1], 
status=failed, 
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, 
ISO 19005-1 or this part of ISO 19005, 
or any extension schemas that comply with 6.6.2.3.2, 
location=Location [level=CosDocument, 
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[11](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)], 
locationContext=null, 
errorMessage=null]], 
isCompliant=false]

into more convenient format like HTML.
veraPDF GUI application (https://docs.verapdf.org/gui/) has HTML output feature (realized via XML to HTML with XSLT), I'll investigate how to integrate it into mn2pdf.

@ronaldtse
Copy link
Contributor Author

@Intelligent2013 isn't it easiest to keep it just as a verification step using the Docker container? It doesn't need to be part of mn2pdf? Or do you prefer integrating it into the mn2pdf local test flow?

@petervwyatt
Copy link

@Intelligent2013 - if you use veraPDF CLI then you can explicitly set the output format you want to be text, json, raw (i.e. xml) or html. The Docker container is relatively new for veraPDF so I will ask if there is a way to set the CLI via Docker...

@Intelligent2013
Copy link
Contributor

@Intelligent2013 isn't it easiest to keep it just as a verification step using the Docker container?

@ronaldtse questions:

  1. what the result of the verification step do you expect? Just 'success' or 'failed'? In case of 'failed' do we need to output user-friendly report (HTML)? Or JSON or XML is enough? (veraPDF Docker container based on veraPDF-rest and currently supports JSON and XML output only). If we need HTML report, then the additional step should added to the GH Actions - apply XSLT to XML for HTML output (veraPDF-library contains such XSLT).
  2. if the verification step is fail, then should we stop any further actions OR just put the report near the PDF and continue the further actions?

@ronaldtse
Copy link
Contributor Author

@Intelligent2013 I think we just want to ensure that the PDF outputs we have comply with PDF/A3-a. You will be the main person looking at it.

I believe a separate GHA workflow that shows individual validation failures in the GHA output would work well.

Using the docker container is preferred because we don't need a local workflow. The output can also contain HTML if it helps you.

If the verification step fails for mn2pdf, the build should be marked "failed". We should generate a set of Metanorma-sourced PDF files to test mn2pdf with. Thoughts welcome!

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Sep 3, 2024
@Intelligent2013
Copy link
Contributor

Workflow for PDF checking by veraPDF added in metanorma/mn-native-pdf#743.

How verapdfcheck.yml is working:

  • wait until ubuntu workflow end (by fountainhead/[email protected])
  • download generated artifacts (by dawidd6/action-download-artifact@v6)
  • install and run veraPDF docker
  • check PDFs and generate reports
  • output reports content with errors
  • output reports filenames

Example output: https://github.com/metanorma/mn-native-pdf/actions/runs/10689964432/job/29633282268?pr=743

@petervwyatt
Copy link

if the verification step is fail, then should we stop any further actions OR just put the report near the PDF and continue the further actions?

I wouldn't treat veraPDF failures as a complete failure and stop so I suggest save the report(s) and continue. I'm also unsure if FOP will fail to produce a PDF or not, such as if it detects an issue when attempting to create PDF/A or PDF/UA. I'm also not sure how much Apache FOP will automatically do things vs. needing the author to correct their AsciiDoc content. For Tagged PDF and PDF/UA, it is highly likely the author will need to do something anyway (e.g. fix alt-text, ensure tables are regular, change colors to have better contrast, etc) but for PDF/A they still might need to do something...

@petervwyatt
Copy link

I've passed several Qs on to the veraPDF and invited them to contribute to this discussion. @bdoubrov

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Sep 12, 2024
@Intelligent2013
Copy link
Contributor

Intelligent2013 commented Sep 12, 2024

PDF/A-3 checking using the veraPDF Docker container integrated into the repository mn-native-pdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

5 participants