Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require DFXML be encoded as UTF-8 #34

Open
ajnelson-nist opened this issue Jul 20, 2018 · 16 comments
Open

Require DFXML be encoded as UTF-8 #34

ajnelson-nist opened this issue Jul 20, 2018 · 16 comments
Milestone

Comments

@ajnelson-nist
Copy link
Contributor

ajnelson-nist commented Jul 20, 2018

DFXML can be generated for file systems that do not use UTF-8 encoding, or even that use arbitrary bytes. (For instance, HFS (not HFS+), allows any byte in a file name except the ASCII colon character.) DFXML has three objectives in case these bytes are encountered:

  • The original bytes should be preserved.
  • The original bytes should be decoded into human-readable strings.
  • The decoding process should present UTF-8 character strings (not byte strings) to a DFXML consumer without requiring additional scripting work to recognize transcodings (e.g. no mode should be required when opening a file that presents accented Latin characters originally encoded in macos-roman). When transcodings are done, though, they should be encoded in the DFXML and thus accessible by the DFXML API being used.

Within the DFXML schema, these constructs will be added to any encodable/transcodable string:

  • The string element will have an optional attribute "original_encoding" to indicate a transcoding occurred.
  • The original bytes encountered in the parse will be recorded in the optional attribute "original_bytes_base64".

Some new semantics will result from this, because there are now 8 states (represented as sets of 3 conditions) we can have for any of these transcoding states, based on presence or absence of (A) the original bytes in base64, (B) the transcoding, and (C) the element's child text. Let absence of one of these conditions be represented as an underscore below as we walk through the state space:

  • [___] It could be that a fileobject was not meant to be named, such as with DFXML being used to distribute file hashes (e.g. with some modes of hashdeep).
  • [__C] Absence of the original bytes and original encoding attributes implies the string was recorded in DFXML exactly as it was encountered.
  • [_B_] This state is a bit uninformative, and should be avoided.
  • [_BC] This implies original bytes were not recorded. This state should be avoided in case of an error with the script that performed the transcoding. If such an error occurred, the original bytes may not be derivable from the DFXML record.
  • [A__] The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name. (H/t to @dd388 for finding a case where an HFS file system recorded "^C^B^AMove&Rename", special control characters for a Mac OS somewhere around version 7.)
  • [A_C] If original bytes are present and UTF-8 text is recorded, this shall imply the original encoding was UTF-8. This may be desired in cases where a unicode character could be encoded in multiple ways, such as with unicode combining characters.
  • [AB_] This state implies the original bytes are decodable, but do not have a corresponding point in unicode space.
  • [ABC] All t's crossed, all i's dotted.

In short, the preferred states are to include original bytes (conditions A**), and include UTF-8 encodings when they are reachable (conditions **C). If there is nothing more complex than ASCII, __C (omitting original bytes and original encoding) would be fine. If the character data are more complex than ASCII, and there is no chance of ambiguity, the original encoding can be omitted (condition A_C). If the file names are all ASCII, or unicode where all characters only have one representation, condition A_C would suffice; however, this may be unnecessarily difficult to determine on the fly, so unicode filenames may be best represented verbosely (condition ABC).

Thanks to @dd388 for assistance drafting this description, to @tw4l for raising the matter, and @simsong for discussion and an article on Programming in Unicode. The original structure proposed in this Issue is close to what came from discussion in the DFXML library Issue.

@ajnelson-nist ajnelson-nist added this to the v1.3.0 milestone Jul 20, 2018
@simsong
Copy link
Contributor

simsong commented Jul 21, 2018 via email

@ajnelson-nist ajnelson-nist modified the milestones: v1.3.0, v1.4.0 May 7, 2021
@ajnelson-nist
Copy link
Contributor Author

I've pushed this back a release to 1.4.0, because this needs a prototype code implementation, and a graceful-feeling solution has not yet come to mind.

@ajnelson-nist
Copy link
Contributor Author

To give an illustrative example of this issue: One project I've encountered processed software files that included the "Registered" symbol in its file names. However, the method of producing those files ended up encoding that symbol as the single byte value 174 (b"\xae"), which is not a valid unicode code point. This Python session transcript shows how that data should transcode to UTF-8, along with demonstrations of decoding stumbles:

>>> import base64
>>> x = b"Fancy Product \xae.exe"
>>> x
b'Fancy Product \xae.exe'
>>> base64.standard_b64encode(x)
b'RmFuY3kgUHJvZHVjdCCuLmV4ZQ=='
>>> x.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 14: ordinal not in range(128)
>>> x.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 14: invalid start byte
>>> x.decode("iso-8859-1")
'Fancy Product ®'
>>> y = x.decode("iso-8859-1").encode("utf-8")
>>> y
b'Fancy Product \xc2\xae.exe'
>>> base64.standard_b64encode(y)
b'RmFuY3kgUHJvZHVjdCDCri5leGU='

As DFXML, this fileobject should present like so after implementation of this Issue in the schema:

<fileobject>
  <filename
    original_bytes="RmFuY3kgUHJvZHVjdCCu"
    original_encoding="iso-8859-1">Fancy Product ®.exe</filename>
</fileobject>

@joachimmetz
Copy link

@ajnelson-nist one clarifying question with "UTF-8 character strings" do you mean the 4-byte variant of RFC 3629 or the 6-byte variant of RFC 2279? (I assume the former but prefer to be specific about it in this context)

@ajnelson-nist
Copy link
Contributor Author

@joachimmetz : I had intended, without digging into citations, to use the 4-byte variant of RFC 3629. But you raise a fair question.

The main influences in my understanding are XML and Python, which are directly in the dependencies of most DFXML applications I'm aware of; and RDF, which is not necessarily pertinent to DFXML but I am aware does have a definition somewhere in its standards stack that its strings are UTF-8. I'm unaware of whether C++ has any inherent dependencies on Unicode; my understanding is there is no such dependency due to C++ predating unicode and just generally operating on more elementary data types, but @simsong could probably say better if we need something better said. I currently suspect we won't need better said.

I believe RFC 22791 is moot for consideration, because RFC 36292 obseletes 2279; RFC 3629 "implements" (loose terminology) ISO 106463; ISO 10646's Annex D (albeit the 2003 version) provides a "technically equivalent" definition in the Unicode Standard per unicode.org's glossary definition of UTF-84; and unicode.org is cited (by bibliography entry, not URL) as a normative reference of XML 1.0 Fifth Edition4.

Python 3's documentation cites unicode.org in this highlighted section of the documentation page "Unicode HOWTO". So I'd follow the same reference chain, ending at RFC 3629, answering your question again with "I meant 4 bytes."

I haven't done the same dive recently through RDF, but my recollection is the citation chain goes through RDF Schema following XML Schema Datatypes.

If you're aware of an application that should make DFXML consider 6-byte UTF-8, I'd be curious to hear about it, but it would be a pretty significant conflict with DFXML's foundation on XML to try to support 6-byte UTF-8.

Footnotes

  1. https://datatracker.ietf.org/doc/html/rfc2279

  2. https://datatracker.ietf.org/doc/html/rfc3629

  3. https://www.iso.org/standard/76835.html

  4. https://www.unicode.org/glossary/#UTF_8 2

@ajnelson-nist
Copy link
Contributor Author

I should note: It appears DFXML has always invisibly required its string-y content be UTF-8 on accident because of some technological dependencies, especially between XML and Python. This Issue was filed possibly without realizing that, but there is still a real challenge being addressed in this Issue, on how to represent transcoding of non-UTF-8 source data.

@joachimmetz
Copy link

joachimmetz commented Jul 3, 2023

Sticking with the 4-byte variant makes sense, it is the current version of UTF-8 and compatible with UTF-16 and the one supported by a current Python 3 implementation (e.g. surrogate pair restriction). I'm not sure about XML, I would assume it supports the 4-byte one.

However the 6-byte variant is still used by certain formats and doesn't have the surrogate pair restriction as far as I can tell. So might be good to account for it for compatibility reasons at minimum.

@ajnelson-nist
Copy link
Contributor Author

Are you able to link to those certain formats are that are using the 6-byte variant?

Do you know how "UTF-8 as defined by RFC 2279" should be spelled as as an encoding string, e.g. like the strings in the "Standard Encodings" table in Python's codecs module1?

Footnotes

  1. https://docs.python.org/3/library/codecs.html#standard-encodings

@joachimmetz
Copy link

joachimmetz commented Jul 3, 2023

Do you know how "UTF-8 as defined by RFC 2279" should be spelled as as an encoding string, e.g. like the strings in the "Standard Encodings" table in Python's codecs module1?

don't think it supports it, but have not looked

Are you able to link to those certain formats are that are using the 6-byte variant?

not from the top of my mind, but I assume anything claiming to be utf-8 before RFC 3629 became a thing

Looks like some Microsoft formats might be using it https://learn.microsoft.com/en-us/search/?scope=OpenSpecs&terms=RFC2279

@joachimmetz
Copy link

joachimmetz commented Jul 4, 2023

The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name

Any example of these? AFAIK C0 and C1 control character can be represented in 4-byte UTF-8 (RFC 3629). Also see: https://en.wikipedia.org/wiki/C0_and_C1_control_codes

AFAIK (1) Surrogates such as U+d800, (2) values (currently) not mapped to characters and (3) values beyond U+10FFFF are going to be the ones that need special treatment

For (2) 4-byte UTF-8 should be able to encode these, but might not meet the "human-readable strings" criteria mentioned above

One option could be to use "\U########" and "\u####" string notation for such characters.

Based on https://www.w3.org/TR/xml/#dt-charref and https://www.w3.org/TR/xml/#wf-Legalchar I'm not 100% sure if XML character escape allows "&#d800"

If the character reference begins with " &#x ", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with " &# ", the digits up to the terminating ; provide a decimal representation of the character's code point.

Unfortunately ISO/IEC 10646 has evolved/changed over the years [1].

This write up provides some historical context https://www.cl.cam.ac.uk/~mgk25/unicode.html

@joachimmetz
Copy link

joachimmetz commented Jul 5, 2023

What I could find is that both XML 1.0 and 1.1 are strict about not allowing such characters https://www.w3.org/TR/2006/REC-xml-20060816/Overview.html#charsets and https://www.w3.org/TR/xml11/#charsets

And if I read the following [1] correctly:

Well-formedness constraint: Legal Character

Characters referred to using character references must match the production for [Char](https://www.w3.org/TR/2006/REC-xml-20060816/Overview.html#NT-Char).

It could be that &#d800 is not allowed per standard

@joachimmetz
Copy link

@ajnelson-nist a couple of more scenarios to consider

  • original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932 also see https://metacpan.org/dist/ShiftJIS-CP932-MapUTF/view/MapUTF.pod#Transcoding-from-Unicode-to-CP-932. What if there are 2 (or more) paths that decode to the same string? How should the original path be best preserved?
  • filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

@joachimmetz
Copy link

Looks like there is WTF-8 https://en.wikipedia.org/wiki/UTF-8#WTF-8

@simsong
Copy link
Contributor

simsong commented Jul 10, 2023 via email

@joachimmetz
Copy link

joachimmetz commented Jul 10, 2023

One of the software engineers raised a good point Python has pathlib for this as well which might also help cover the cp932 edge cases I mentioned

sldouglas-nist added a commit to sldouglas-nist/INDXParse that referenced this issue Jul 20, 2023
The patch provides `unpack_wstring` with a return type annotation and
revisits the referenced, committed patch. This is a prototype and may
be addressed further in a future patch. Our understanding of type
annotations for the fields the `declare_field` function creates is
limited by its dynamic assignment of these variables. A type safe
solution will be difficult to attain until `Block` and its subclasses
are re-implemented into a `@property` getter/setter approach to provide
static type review.

References:
* dfxml-working-group/dfxml_schema#34
* williballenthin@e96e04b

Disclaimer: Participation by NIST in the creation of the documentation
of mentioned software is not intended to imply a recommendation or
endorsement by the National Institute of Standards and Technology, nor
is it intended to imply that any specific software is necessarily the
best available for the purpose.

Licensing:
Portions of this patch contributed by NIST are governed by the NIST
Software Licensing Statement:

NIST-developed software is provided by NIST as a public service. You may
use, copy, and distribute copies of the software in any medium, provided
that you keep intact this entire notice. You may improve, modify, and
create derivative works of the software or any portion of the software,
and you may copy and distribute such modifications or works. Modified
works should carry a notice stating that you changed the software and
should note the date and nature of any such change. Please explicitly
acknowledge the National Institute of Standards and Technology as the
source of the software.

NIST-developed software is expressly provided "AS IS." NIST MAKES NO
WARRANTY OF ANY KIND, EXPRESS, IMPLIED, IN FACT, OR ARISING BY OPERATION
OF LAW, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND
DATA ACCURACY. NIST NEITHER REPRESENTS NOR WARRANTS THAT THE OPERATION
OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE, OR THAT ANY DEFECTS
WILL BE CORRECTED. NIST DOES NOT WARRANT OR MAKE ANY REPRESENTATIONS
REGARDING THE USE OF THE SOFTWARE OR THE RESULTS THEREOF, INCLUDING BUT
NOT LIMITED TO THE CORRECTNESS, ACCURACY, RELIABILITY, OR USEFULNESS OF
THE SOFTWARE.

You are solely responsible for determining the appropriateness of using
and distributing the software and you assume all risks associated with
its use, including but not limited to the risks and costs of program
errors, compliance with applicable laws, damage to or loss of data,
programs or equipment, and the unavailability or interruption of
operation. This software is not intended to be used in any situation
where a failure could cause risk of injury or damage to property. The
software developed by NIST employees is not subject to copyright
protection within the United States.

Co-authored-by: Alex Nelson <[email protected]>
Signed-off-by: Sheldon Douglas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants