Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Tags (accessibility) #1234

Open
christophersavory opened this issue Mar 16, 2023 · 18 comments
Open

PDF Tags (accessibility) #1234

christophersavory opened this issue Mar 16, 2023 · 18 comments

Comments

@christophersavory
Copy link

I found this discussion regarding iText tags and accessibility in a forum from 2012.
https://forums.opentext.com/forums/developer/discussion/52337/how-do-you-create-an-accessible-pdf-report-with-birt

I couldn't find anything regarding PDF accessibility in the BIRT Docs.
https://eclipse.github.io/birt-website/docs/t_brief-editor-tour

Has PDF Emitter accessibility (tags) been implemented into BIRT? If not, is it on the roadmap?

@hvbtup
Copy link
Contributor

hvbtup commented Mar 17, 2023

I asked the same question a few years ago.
It is not on the roadmap.
I think really many people world-wide would benefit from PDF/UA support.

PDFs generated for authorities MUST support this in the EU.

But nobody is willing to pay for it.
And it will be a lot of work, so it cannot be done by an enthusiastic hobby programmer.

Where are the big companies, where are the states, where is the EU?
IMHO this is a topic where it is necessary to actually pay for open source development.

From a technical point of view:

BIRT uses OpenPDF for PDF creation.
AFAIK OpenPDF does not support creating Tagged PDF which is a precondition for PDF/UA.
See LibrePDF/OpenPDF#181

The BIRT community can only start developing PDF/UA support for BIRT after OpenPDF added support for it.

At least one tiny bit of preparation is included in BIRT:
There is a PDF tag type property.
grafik
This was certainly meant to assist creating tagged PDFs.
But AFAIK this property isn't actually used.

I don't know: Did the commercial BIRT product support creating tagged PDFs?

@hvbtup
Copy link
Contributor

hvbtup commented Mar 23, 2023

It seems like JasperReports supports creating PDF/UA.
Internally, JasperReports uses OpenPDF just like BIRT (a patched version at the moment, see LibrePDF/OpenPDF#765)

So, in contrast to what I said in my previous comment:

It would certainly be possible to create PDF/UA with in BIRT.

It's just that the OpenPDF community itself is not focused on this and doesn't provide examples.
But if JasperReports can do it, why shouldn't we?

Still, this is certainly a lot of work.

@christophersavory
Copy link
Author

It looks like OpenPDF might already support PDF/A-1a and PDF/A-1b

See lines 1738 and 1740 of https://github.com/LibrePDF/OpenPDF/blob/3b38ad8588669d24fd1f772ec10bb516e996e3c1/openpdf/src/main/java/com/lowagie/text/pdf/PdfWriter.java

Do we just need to create a new EMITTER_ID that will set the correct PDFXConformance on the PDFWriter?

@MayurDeore
Copy link

Any updates on implementing tagged PDF functionality? Has anyone made progress on this?

@hvbargen
Copy link
Contributor

hvbargen commented Oct 4, 2023

I'm on vacation this week and spent some time investigating yesterday. I was able to create a valid PDF/UA 1 document using the rather low-level functions of OpenPDF (validation done with the PAC 2021 validator).
The idea is that you create a structure tree containing the logical structure of the content (similar to basic HTML) and you link every content on the pages to one of these structure elements.
Basically, the structure elements correspond to the item instances.
So I really think it is possible to add this to BIRT.
But there is a lot to think about.
E.g. do we need to distinguish between locale and language; and while it is typical that a BIRT corresponds to a specific tag in the structure tree, this is not always the case, so this must be configurable.

@hvbargen
Copy link
Contributor

hvbargen commented Oct 4, 2023

Reminder to myself: ATM the example is saved to a private GH repo.

@luzhanov
Copy link
Contributor

luzhanov commented Oct 4, 2023

@hvbargen may I ask you to share your experimental code, please, if possible? I'm doing some similar attempts to integrate PDF/A tags to the custom PDF emitter, and this will be really helpful.

@hvbargen
Copy link
Contributor

hvbargen commented Oct 4, 2023

OK, here it is:

https://github.com/hvbargen/openpdf-ua

As I said, I'm convinced that it is possible to add PDF/UA (and PDF/A) support to BIRT.
But this cannot work as a one-man-show. We need people who can help specifying, programming, testing.
Anyway, this can only start after 4.14.

@luzhanov
Copy link
Contributor

luzhanov commented Oct 5, 2023

Thank you @hvbtup, this is really helpful! I can see that my approach is similar to yours, I'm trying to do with BIRT PDF emitter - first I'm adding the tag structure in similar way, but the most tricky part is to tag content itself (every text entry, image etc).

Currently we're using Birt 4.9.0 as far is the latest pom-based version we can integrate in our application. This issue is blocking us from moving to newest versions: #625

@wimjongman
Copy link
Contributor

Is this solved with the latest PDF patches?

@wimjongman
Copy link
Contributor

image

@hvbtup
Copy link
Contributor

hvbtup commented Sep 16, 2024

I started working on a proof of concept during my vacation with my hobby account.

I copied this to our companies fork today, so this is available at https://github.com/triestram-partner/birt/tree/tagged-pdf, in a very, very early stage.

I am able to create a PDF/UA file consisting of labels only (without any background colors or lines etc) which PAC 2024 accepts.

Currently I'm trying to add support for images, but PAC 2024 still reports one error for the resulting file.

The POC is demonstrating that it is in fact possible to generate PDF/UA with a modified version of BIRT and OpenPDF as the backend. But there are lots of open issues.
For example, I expect major challenges when it comes to page breaks and to HTML dynamic text items.
It might turn out that the simple concept of the implementation is not sufficient and we must start from scratch again, but I try to be optimistic.
That said, there are also a lot of tasks which should be relatively easy, but all of this certainly amounts to several weeks or months of work before it even can be beta-tested.

This is definitely not something I (or anyone else) can develop all alone as a a hobby project, so I'm asking other developers for help.

For this, I think it would make sense to create a branch tagged-pdf in the official repository. @wimjongman: Any objections?

@wimjongman
Copy link
Contributor

No objections from me, I'm fine with a shared branch.

@hvbtup
Copy link
Contributor

hvbtup commented Sep 16, 2024

OK, folks, help is welcome.

The branch is https://github.com/eclipse-birt/birt/tree/tagged-pdf

To test the branch, create a report with the following properties:

Report properties:

  • Set a locale like de-DE or en-US in the advanced properties

  • Specify a title.

  • In the PDF emitter options, select

    • PDF version: 1.7
    • PDF conformance: PDF/A-1A
    • PDF/UA conformance (or version?): 1 (let's start with PDF/UA-1 support, PDF/UA-2 support can be added later)

Not sure, there may more settings necessary (regarding fonts), I'll upload an example report later when I'm at home.

For the layout elements, be sure to add appropriate PDF tags like P or H1.

Choose run report ... as PDF from the menu in the all-in-one-designer, download the PDF from your browser.
Use PAC 2024 or another tool to validate the PDF.

@merks
Copy link
Contributor

merks commented Sep 16, 2024

This is copied from the commit...


I would really much prefer that you push changes to your own fork and create pull requests to kick off builds rather than push branches into the main repository.

The workflow for how to do that is described here:

https://github.com/orgs/eclipse-simrel/discussions/3

I.e., I don't expect to see a bunch of "personal" branches being built:

image

but rather PR builds:

image

Via the PRs there is a place to review and discuss changes.

You can easily amend your commits and force push to your fork to kick off new builds. That's also described in the discuss above as well.

Your build is failing because it isn't based on master and hence has an older target platform referring to a p2 repository that no longer exists.

Before starting a PR effort, always pull master and create your branch off master...

@hvbtup
Copy link
Contributor

hvbtup commented Sep 17, 2024

This is an example report which only produces one error when checked with PAC 2024:
text_and_image_example.zip

@hvbtup
Copy link
Contributor

hvbtup commented Dec 5, 2024

I just want to inform you that I made some good progress in the past few days.

See the branch pdf-tag-page-break in the fork for my hobby account.

I can now create tables and lists that are longer than one page including a valid tag structure.
This even works for tables with captions (I never noticed this property in the advanced settings, and it seems to be quite forgotten generally, for example there is no predefined style for it).

However, there is still very much to test and to code:

  • I did not test splits of table rows (that includes splits of cells).
  • I did not test splits of plain text or HTML text.
  • I did not test grids and I don't know how "forms" should be handled regarding the strucure. Many of our report use a grid to show properties, e.g. for an order, with property name and property value, e.g. "Order no: 24-12345".
  • I am not interested in charts.
  • I am unsure how to handle HTML dynamic text (in particular, headings).
  • I don't have a screen reader, only PAC 2024.
  • While the structure is working for multi-page tables, I am not quite satisfied with the way I have coded this. I needed half a dozen new attributes for ContainerArea instances, which of course adds some bytes to the memory needed per instance and there are lots of such instances. This information is only needed when PDF/UA is created, so I think it would be better to move this into a separate object that is only allocated when actually needed.

@wimjongman
Copy link
Contributor

Thanks for the update, Henning. It sounds like you are making great progress. Wonderful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants