-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-printable characters make it into metadata, but the response breaks the XML parser #582
Comments
I think if you allow for UTF 8 encoding by iRODS internally, we're fine. I set my XML parser to QUASI_XML for this experiment:
...and
|
If it turns out the UTF-8 thing isn't the issue with the original demo script, I'll look a bit deeper. |
I also can't vouch for ElementTree (the default XML parser) doing "the right thing" (at least where the iRODS server is concerned). At this point ... I think QUASI_XML does better. |
The QUASI_XML setting addresses the breaking of the XML parser (ElementTree was never really meant for handling such characters in the PackStruct XML-ish dialect, and do we change packstruct now?). As to what |
I see that
|
So then, how do we best address the issue under the default parser?
|
how would we know this / can we? is this a reasonable thing to do? is what's in the database... correct?
this feels... better than a) or... can we catch non-standard / non-utf8 characters on the way into the database for metadata in the first place? and then this issue of parsing a 'bad' response becomes moot?
this is a good backstop/minimal answer if we can't figure out something more 'active'. |
I'll look at these, and whether the AVU entered under exception was correct. There is a fourth choice, and that is to make QUASI_XML the default parser. This seems to be in line with the server's behavior; if iRODS is entering it into the database, then it is likely also returning a non-error response, which we're then choking on. QUASI_XML was written for the express purpose of being less "standard" XML and more iRODS xml-ish. |
Also there are two pathways for the same binary data. Let's consider the four-byte binary string variously expressed as a bytestring b"a\xC3\xAFb", or as the direct mapping via ordinal codes into a Unicode string, (in Python3 this is "a\xC3\xEFb", but it is also expressible in both Py2/Py3 unambiguously as as u"a\u00C3\u00AFb".) Thus there are two ways to give this data to the AVU .set (or .add) call, expounded on below. Before reading on. please be aware that direct mapping via the ordinal values is not the same as the UTF8 way of doing things. In that encoding, we assume binary data and Unicode strings to be equivalent only if they've passed through a call to encode( ) or decode( ), which may change the length of the string by replacing any higher-ordinal-value Unicode character (integer code >= 0x80) by multiple bytes, that is encoding; and then reversing that transformation in the decode( )call ie. when going back to the Unicode string. So here are our two options of how to feed the given binary sequence, as promised: (1) pass in the unicode value, ie transformed via the method used in the issue description script: (2) pass in the bytes value mentioned previously: b"a\xC3\xAFb" . If that bytes value is fed to the AVU set( ) or add( ) call, it will be seen as a bytestring and therefore interpreted as equivalent to the three-character unicode sequence: u"a\u00EFb" . What you get back from iRODS and the catalog on a metadata read will then be a bytestring in Python2 or a Unicode string in Python3, but the understood equivalency via |
I mention all of this because it affects what you would consider as equal (or equivalent) in terms of binary data. And in light of it, I've found -- so far -- that PRC under Py3 (as also under Py2) does the right thing in both cases. |
@trel: If we're forced to make a choice, and making QUASI_XML the universal default seems too risky, consider:
I suggest that we provide such a utility function as this They can do so, just by using that function in a with - statement. I think that is the solution I favor. |
this fourth option, an I also think that we should consider how we can make add/set possibly only take proper utf-8 strings in the first place. |
I like the Have we ever ran the full test suite using the QUASI_XML parser? If the QUASI_XML parser is more in line with the iRODS XML encoding, then it sounds like it should be the default. |
I kind of agree about making it the default, though I am inherently more scared of it. Which is why I put forth the I don't recall ever running the whole suite under QUASI_XML although I have run |
Agreed. It makes sense, but we're not going to switch the default for v2.x. However, knowing whether the test suite passes/fails with the QUASI_XML parser would be good to know. Please add that to your TODO list as a medium priority item and let's get |
xml_mode is now #586 and will be in 2.x |
The error is incurred when running |
so let's leave this open until we add a test (linked to this issue) to confirm this behavior is working as expected. |
see here for the former content of this message |
Running the following code...
The AVU makes it in, but the response is not parsed correctly by the PRC.
We should handle this better... but where?
The text was updated successfully, but these errors were encountered: