Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for encode-marcXml and encode-xml for escaping unicode characters #529

Open
TobiasNx opened this issue Apr 30, 2024 · 9 comments

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Apr 30, 2024

https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-lines%0A%7Cdecode-json%0A%7Cencode-xml%28rootTag%3D%22collection%22%29%0A%7Cprint%0A%3B&data=%7B%22name%22%3A+%22Open+Educational+Resources+-+\u000Beine+kritische+Einf%C3%BChrung%22%7D

When special unicode characters are part of the transformed metadata encode-xml and encode-marcXml might transform the metadata to invalid xml.

There should be the possibility to create valid xml at least by adding an option for escping unicode characters.

@blackwinter
Copy link
Member

The behaviour regarding unicode characters seems to have changed in 844aa74. Previously, StringEscapeUtils.escapeXml() used to escape unicode characters to their numerical \u equivalent.

@Phu2
Copy link

Phu2 commented May 2, 2024

I had a similar issue with Catmandu some years ago.

@Phu2
Copy link

Phu2 commented May 2, 2024

Two things are mixed up here, i think:

  1. escaping characters like & using XML entities
  2. characters that are not allowed at all in XML

The issue pointed out by @TobiasNx belongs to 2.

See also https://stackoverflow.com/questions/730133/what-are-invalid-characters-in-xml

@blackwinter
Copy link
Member

Two things are mixed up here, i think:

What makes you think that? Syntax markers (1) are still escaped. AFAIUI, we're only talking about escaping invalid characters (2) here.

@Phu2
Copy link

Phu2 commented May 2, 2024

What makes you think that?

The docs of StringEscapeUtils.escapeXml() say it does 1. and besides of that:

Note that unicode characters greater than 0x7f are currently escaped to their numerical \u equivalent. This may change in future releases.

That's why i think it does not 2. (remove or replace invalid characters).

As a side note, from the logs of QA catalogue:

Apr 30, 2024 8:55:35 AM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: oersiMarcDump.xml.tar.gz
[main] INFO org.reflections.Reflections - Reflections took 300 ms to scan 1 urls, producing 3 keys and 453 values
[Fatal Error] :3319174:57: An invalid XML character (Unicode: 0xb) was found in the element content of the document.

0xb is below 0x7f according to https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024&names=-&utf8=0x so this control character won't be handled by escapeXml() at all, i think.

@Phu2
Copy link

Phu2 commented May 2, 2024

I think the title of this issue is misleading, too.

@blackwinter
Copy link
Member

Now I see what you mean, thanks. I haven't actually looked at the character values in question ;)

This implies that the behaviour in Metafacture hasn't changed, but rather that it's always been this way (with regard to invalid XML characters below 0x7f).

This also means that simply passing escapeUnicode=true to XmlUtil.escape() won't affect all invalid characters (namely, those below 0x7f).

@Phu2
Copy link

Phu2 commented May 2, 2024

Yep.
See https://www.w3.org/TR/xml/#charsets for all invalid and "discouraged" characters in XML.

The behaviour regarding unicode characters seems to have changed in 844aa74.

Looking at SimpleXmlWriter.java it seems to me that XML is being written "manually" (eg. builder.append("<");) and not by a XML library which would capture those invalid characters.

@blackwinter
Copy link
Member

I would suggest the following:

  1. Keep this issue as feature request for exposing an option to escapeUnicode on SimpleXmlEncoder and MarcXmlEncoder (if there is a use case).
  2. Open a new issue for the bug that SimpleXmlEncoder and MarcXmlEncoder (as well as XmlElementSplitter?) may produce invalid XML. (Could probably be fixed in XmlUtil.escapeCodePoint(), regardless of escapeUnicode.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants