Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-printable characters make it into metadata, but the response breaks the XML parser #582

Open
trel opened this issue Jul 10, 2024 · 19 comments
Assignees
Labels
Milestone

Comments

@trel
Copy link
Member

trel commented Jul 10, 2024

Running the following code...

from irods.session import iRODSSession

with iRODSSession(host='localhost', port=1247, user='rods', password='rods', zone='tempZone') as session:

    hexvalue = "53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31"
    hex_list = [int(byte, 16) for byte in hexvalue.split(":")]
    string_value = "".join([chr(byte) for byte in hex_list])

    obj = session.data_objects.get('/tempZone/home/rods/file.txt')

    attr_str = "awesome"
    value_str = string_value

    obj.metadata.add( attr_str, value_str )

The AVU makes it in, but the response is not parsed correctly by the PRC.

$ python pyclient.py 
Traceback (most recent call last):
  File "pyclient.py", line 14, in <module>
    obj.metadata.add( attr_str, value_str )
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/data_object.py", line 86, in metadata
    self._meta = iRODSMetaCollection(
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/meta.py", line 89, in __init__
    self._reset_metadata()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/meta.py", line 92, in _reset_metadata
    self._meta = self._manager.get(self._model_cls, self._path)
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/manager/metadata_manager.py", line 82, in get
    results = self.sess.query(*columns).filter(*conditions).all()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/query.py", line 212, in all
    result_set = self.execute()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/query.py", line 198, in execute
    results = result_message.get_main_message(GenQueryResponse)
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/message/__init__.py", line 400, in get_main_message
    msg.unpack(ET().fromstring(self.msg))
  File "/usr/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 19, column 29
$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
attribute: awesome
value: SkopeCalX��#2d1
units: 

We should handle this better... but where?

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 10, 2024

I think if you allow for UTF 8 encoding by iRODS internally, we're fine.

I set my XML parser to QUASI_XML for this experiment:

$ more ~/.python_irodsclient 
data_objects.auto_close True
connections.xml_parser_default "QUASI_XML"
$ PYTHON_IRODSCLIENT_CONFIGURATION_PATH="" python  unprintable_meta.py 
<class 'str'>
v_bytes=[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
utf8avu=[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]

...and unprintable_meta.py is the script below:

#!/usr/bin/env python3
from irods.meta import *
import irods.test.helpers as h

v = b'SkopeCalX\xc3\xaf\xc2\xbf\xc2\xbd\xc3\xaf\xc2\xbf\xc2\xbd#\x012d1'
v_bytes = [83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
assert bytearray(v_bytes) == v, "original value not correct"

avu = iRODSMeta ('a',v)

s = h.make_session()
dn = '/tempZone/home/rods/100m'  # _-_-_ use any data object here !!!
d = s.data_objects.get(dn)
d.metadata.set(avu)

# --- retrieve / parse for printing
retrieved_avus = [_ for _ in d.metadata.items() if _.name == 'a']
print(type(retrieved_avus[0].value))

utf8avu = [i for i in bytearray(retrieved_avus[0].value.encode('utf-8'))]
#
# returned string won't match ORIGINAL avu value's v_bytes because it is retrieved as a unicode value
#assert([ord(_) for _ in retrieved_avus[0].value] == v_bytes), "retrieved value not intact (unless you assume iRODS prefers UTF8 to straight (Latin?/ascii-bin?) encoding)"
 
# But, these are equal:
print(f'{v_bytes=}')
print(f'{utf8avu=}')
assert v_bytes == utf8avu

@d-w-moore
Copy link
Collaborator

We should handle this better... but where?

If it turns out the UTF-8 thing isn't the issue with the original demo script, I'll look a bit deeper.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 10, 2024

I also can't vouch for ElementTree (the default XML parser) doing "the right thing" (at least where the iRODS server is concerned).

At this point ... I think QUASI_XML does better.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 10, 2024

The QUASI_XML setting addresses the breaking of the XML parser (ElementTree was never really meant for handling such characters in the PackStruct XML-ish dialect, and do we change packstruct now?).

As to what imeta ls should have dug out of that foiled avu-setting attempt, I'll look into it, as mentioned, but i suspect iRODS trades on UTF-8 rather than ascii binary. So for example, if in the future we would like to store b'\xef' in an AVU field, we won't have very good luck; but we may do better storing Unicode u'\u00ef' in its place - which, one can can see from the printout in my example output, will be encoded as bytes [195,175] by iRODS due to its internal use of UTF-8.

@trel
Copy link
Member Author

trel commented Jul 10, 2024

I see that QUASI_XML DOES prevent the exception on parsing the response...

$ irm -rf file.txt
$ itouch file.txt
$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
None
$ PYTHON_IRODSCLIENT_CONFIGURATION_PATH="" PYTHON_IRODSCLIENT_CONFIG__CONNECTIONS__XML_PARSER_DEFAULT="'QUASI_XML'" python pyclient.py
WARNING:irods.client_configuration:Config file not available at '/home/trel/.python_irodsclient'
$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
attribute: awesome
value: SkopeCalX��#2d1
units: 

@d-w-moore
Copy link
Collaborator

So then, how do we best address the issue under the default parser?

  • artificially suppress the exception when the AVU turns out to have made it into the catalog
  • if the exception happens, catch it, remove the AVU from the catalog, and rethrow
  • print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

@trel
Copy link
Member Author

trel commented Jul 13, 2024

So then, how do we best address the issue under the default parser?

  • artificially suppress the exception when the AVU turns out to have made it into the catalog

how would we know this / can we? is this a reasonable thing to do? is what's in the database... correct?

  • if the exception happens, catch it, remove the AVU from the catalog, and rethrow

this feels... better than a)

or... can we catch non-standard / non-utf8 characters on the way into the database for metadata in the first place? and then this issue of parsing a 'bad' response becomes moot?

  • print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

this is a good backstop/minimal answer if we can't figure out something more 'active'.

@d-w-moore
Copy link
Collaborator

I'll look at these, and whether the AVU entered under exception was correct.

There is a fourth choice, and that is to make QUASI_XML the default parser. This seems to be in line with the server's behavior; if iRODS is entering it into the database, then it is likely also returning a non-error response, which we're then choking on. QUASI_XML was written for the express purpose of being less "standard" XML and more iRODS xml-ish.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 13, 2024

Also there are two pathways for the same binary data. Let's consider the four-byte binary string variously expressed as a bytestring b"a\xC3\xAFb", or as the direct mapping via ordinal codes into a Unicode string, (in Python3 this is "a\xC3\xEFb", but it is also expressible in both Py2/Py3 unambiguously as as u"a\u00C3\u00AFb".)

Thus there are two ways to give this data to the AVU .set (or .add) call, expounded on below. Before reading on. please be aware that direct mapping via the ordinal values is not the same as the UTF8 way of doing things. In that encoding, we assume binary data and Unicode strings to be equivalent only if they've passed through a call to encode( ) or decode( ), which may change the length of the string by replacing any higher-ordinal-value Unicode character (integer code >= 0x80) by multiple bytes, that is encoding; and then reversing that transformation in the decode( )call ie. when going back to the Unicode string.

So here are our two options of how to feed the given binary sequence, as promised:

(1) pass in the unicode value, ie transformed via the method used in the issue description script: "".join([chr(i) for i in [0x61,0xc3,0xaf,0x62]]) . (Note the Py3 call chr(i) is equivalent to Py2 unichr(i)). This generates unicode value we mentioned, that is u"a\0u00C3\u00AFb" . Suppose we then hand that Unicode direct-ordinal-value mapping of the binary data to the metadata .set/.add call as the AVU value to be used. What you get back out of the catalog will be an exactly equal, if running in PRC under Python3, because then the PRC prefers to return Unicode. (Under Python2 we'd get the encoded bytes b'a\xc3\x83\xc2\xafb' because that's what PRC under Python2 prefers. Still equivalent though, because the original value we jammed in was the UTF-8 decoding of that string.)

(2) pass in the bytes value mentioned previously: b"a\xC3\xAFb" . If that bytes value is fed to the AVU set( ) or add( ) call, it will be seen as a bytestring and therefore interpreted as equivalent to the three-character unicode sequence: u"a\u00EFb" . What you get back from iRODS and the catalog on a metadata read will then be a bytestring in Python2 or a Unicode string in Python3, but the understood equivalency via str.decode('utf8') and str.encode('utf8') will be preserved.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 13, 2024

I mention all of this because it affects what you would consider as equal (or equivalent) in terms of binary data. And in light of it, I've found -- so far -- that PRC under Py3 (as also under Py2) does the right thing in both cases.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 14, 2024

We should handle this better... but where?

So then, how do we best address the issue under the default parser?

  • artificially suppress the exception when the AVU turns out to have made it into the catalog
  • if the exception happens, catch it, remove the AVU from the catalog, and rethrow
  • print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

@trel: If we're forced to make a choice, and making QUASI_XML the universal default seems too risky, consider:

import contextlib,irods.test.helpers
from irods.message import *
@contextlib.contextmanager
def xml_mode(s):
    try:
        ET(getattr(XML_Parser_Type,s)) # change 'default choice' for this thread temporarily (under with block)
        yield
    finally:
        ET(None)  # return to global default for this thread temporarily (under with block)

# Application: 
with irods.test.helpers.make_session() as session:
    # ... # all activity in current thread will be done using STANDARD_XML parser

    with xml_mode('QUASI_XML'):
        # ... # all activity in current thread will be done using QUASI_XML parser 

    # ... # all activity in current thread will be done again using STANDARD_XML parser

I suggest that we provide such a utility function as this xml_mode above, making it part of PRC, which users can avail themselves of in temporary and specialized circumstances (for example setting AVU's with unprintable characters) ....

They can do so, just by using that function in a with - statement.

I think that is the solution I favor.

@trel
Copy link
Member Author

trel commented Jul 14, 2024

this fourth option, an xml_mode() function... does seem very flexible and good since we do not know how people may prefer to hold this.

I also think that we should consider how we can make add/set possibly only take proper utf-8 strings in the first place.

@korydraughn
Copy link
Contributor

I like the xml_mode() option.

Have we ever ran the full test suite using the QUASI_XML parser? If the QUASI_XML parser is more in line with the iRODS XML encoding, then it sounds like it should be the default.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Jul 16, 2024

I like the xml_mode() option.

Have we ever ran the full test suite using the QUASI_XML parser? If the QUASI_XML parser is more in line with the iRODS XML encoding, then it sounds like it should be the default.

I kind of agree about making it the default, though I am inherently more scared of it. Which is why I put forth the xml_mode option.

I don't recall ever running the whole suite under QUASI_XML although I have run meta_test and data_obj_test at the bench, and it performed just fine.

@korydraughn
Copy link
Contributor

Agreed. It makes sense, but we're not going to switch the default for v2.x. However, knowing whether the test suite passes/fails with the QUASI_XML parser would be good to know.

Please add that to your TODO list as a medium priority item and let's get xml_mode implemented.

@trel
Copy link
Member Author

trel commented Jul 17, 2024

xml_mode is now #586 and will be in 2.x

@d-w-moore
Copy link
Collaborator

d-w-moore commented Nov 13, 2024

The error is incurred when running _reset_metadata , which is a separate query that effectively transports the entire set of metadata back to the client side. (In other words, it's not in the response to the metadata change or add request that the XML error happens.) Use of irods.helpers.xml_mode('QUASI_XML') cures the error in the code in the description, however, because it overrides the parsing in all xml parsing within the scope of the "add" call. So this issue can be closed without much further ado. I think I do need to add a test around metadata in particular, however, this the issue centers on it.

@trel
Copy link
Member Author

trel commented Nov 13, 2024

so let's leave this open until we add a test (linked to this issue) to confirm this behavior is working as expected.

@d-w-moore
Copy link
Collaborator

d-w-moore commented Nov 13, 2024

see here for the former content of this message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants