-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require DFXML be encoded as UTF-8 #34
Comments
Good work. Thanks for the kind mention of my Usenix Unicode article; it’s one of my favorites.
… On Jul 20, 2018, at 4:11 PM, Alex Nelson ***@***.***> wrote:
DFXML can be generated for file systems that do not use UTF-8 encoding, or even that use arbitrary bytes. (For instance, HFS (not HFS+), allows any byte in a file name except the ASCII colon character.) DFXML has three objectives in case these bytes are encountered:
The original bytes should be preserved.
The original bytes should be decoded into human-readable strings.
The decoding process should present UTF-8 character strings (not byte strings) to a DFXML consumer without requiring additional scripting work to recognize transcodings (e.g. no mode should be required when opening a file that presents accented Latin characters originally encoded in macos-roman). When transcodings are done, though, they should be encoded in the DFXML and thus accessible by the DFXML API being used.
Within the DFXML schema, these constructs will be added to any encodable/transcodable string:
The string element will have an optional attribute "original_encoding" to indicate a transcoding occurred.
The original bytes encountered in the parse will be recorded in the optional attribute "original_bytes_base64".
Some new semantics will result from this, because there are now 8 states (represented as sets of 3 conditions) we can have for any of these transcoding states, based on presence or absence of (A) the original bytes in base64, (B) the transcoding, and (C) the element's child text. Let absence of one of these conditions be represented as an underscore below as we walk through the state space:
[___] It could be that a fileobject was not meant to be named, such as with DFXML being used to distribute file hashes (e.g. with some modes of hashdeep).
[__C] Absence of the original bytes and original encoding attributes implies the string was recorded in DFXML exactly as it was encountered.
[_B_] This state is a bit uninformative, and should be avoided.
[_BC] This implies original bytes were not recorded. This state should be avoided in case of an error with the script that performed the transcoding. If such an error occurred, the original bytes may not be derivable from the DFXML record.
[A__] The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name. (H/t to @dd388 for finding a case where an HFS file system recorded "^C^B^AMove&Rename", special control characters for a Mac OS somewhere around version 7.)
[A_C] If original bytes are present and UTF-8 text is recorded, this shall imply the original encoding was UTF-8. This may be desired in cases where a unicode character could be encoded in multiple ways, such as with unicode combining characters.
[AB_] This state implies the original bytes are decodable, but do not have a corresponding point in unicode space.
[ABC] All t's crossed, all i's dotted.
In short, the preferred states are to include original bytes (conditions A**), and include UTF-8 encodings when they are reachable (conditions **C). If there is nothing more complex than ASCII, __C (omitting original bytes and original encoding) would be fine. If the character data are more complex than ASCII, and there is no chance of ambiguity, the original encoding can be omitted (condition A_C). If the file names are all ASCII, or unicode where all characters only have one representation, condition A_C would suffice; however, this may be unnecessarily difficult to determine on the fly, so unicode filenames may be best represented verbosely (condition ABC).
Thanks to @dd388 for assistance drafting this description, to @timothyryanwalsh for raising the matter, and @simsong for discussion and an article on Programming in Unicode. The original structure proposed in this Issue is close to what came from discussion in the DFXML library Issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I've pushed this back a release to 1.4.0, because this needs a prototype code implementation, and a graceful-feeling solution has not yet come to mind. |
To give an illustrative example of this issue: One project I've encountered processed software files that included the "Registered" symbol in its file names. However, the method of producing those files ended up encoding that symbol as the single byte value 174 ( >>> import base64
>>> x = b"Fancy Product \xae.exe"
>>> x
b'Fancy Product \xae.exe'
>>> base64.standard_b64encode(x)
b'RmFuY3kgUHJvZHVjdCCuLmV4ZQ=='
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 14: ordinal not in range(128)
>>> x.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 14: invalid start byte
>>> x.decode("iso-8859-1")
'Fancy Product ®'
>>> y = x.decode("iso-8859-1").encode("utf-8")
>>> y
b'Fancy Product \xc2\xae.exe'
>>> base64.standard_b64encode(y)
b'RmFuY3kgUHJvZHVjdCDCri5leGU=' As DFXML, this <fileobject>
<filename
original_bytes="RmFuY3kgUHJvZHVjdCCu"
original_encoding="iso-8859-1">Fancy Product ®.exe</filename>
</fileobject> |
@ajnelson-nist one clarifying question with "UTF-8 character strings" do you mean the 4-byte variant of RFC 3629 or the 6-byte variant of RFC 2279? (I assume the former but prefer to be specific about it in this context) |
@joachimmetz : I had intended, without digging into citations, to use the 4-byte variant of RFC 3629. But you raise a fair question. The main influences in my understanding are XML and Python, which are directly in the dependencies of most DFXML applications I'm aware of; and RDF, which is not necessarily pertinent to DFXML but I am aware does have a definition somewhere in its standards stack that its strings are UTF-8. I'm unaware of whether C++ has any inherent dependencies on Unicode; my understanding is there is no such dependency due to C++ predating unicode and just generally operating on more elementary data types, but @simsong could probably say better if we need something better said. I currently suspect we won't need better said. I believe RFC 22791 is moot for consideration, because RFC 36292 obseletes 2279; RFC 3629 "implements" (loose terminology) ISO 106463; ISO 10646's Annex D (albeit the 2003 version) provides a "technically equivalent" definition in the Unicode Standard per unicode.org's glossary definition of UTF-84; and unicode.org is cited (by bibliography entry, not URL) as a normative reference of XML 1.0 Fifth Edition4. Python 3's documentation cites unicode.org in this highlighted section of the documentation page "Unicode HOWTO". So I'd follow the same reference chain, ending at RFC 3629, answering your question again with "I meant 4 bytes." I haven't done the same dive recently through RDF, but my recollection is the citation chain goes through RDF Schema following XML Schema Datatypes. If you're aware of an application that should make DFXML consider 6-byte UTF-8, I'd be curious to hear about it, but it would be a pretty significant conflict with DFXML's foundation on XML to try to support 6-byte UTF-8. Footnotes |
I should note: It appears DFXML has always invisibly required its string-y content be UTF-8 on accident because of some technological dependencies, especially between XML and Python. This Issue was filed possibly without realizing that, but there is still a real challenge being addressed in this Issue, on how to represent transcoding of non-UTF-8 source data. |
Sticking with the 4-byte variant makes sense, it is the current version of UTF-8 and compatible with UTF-16 and the one supported by a current Python 3 implementation (e.g. surrogate pair restriction). I'm not sure about XML, I would assume it supports the 4-byte one. However the 6-byte variant is still used by certain formats and doesn't have the surrogate pair restriction as far as I can tell. So might be good to account for it for compatibility reasons at minimum. |
Are you able to link to those certain formats are that are using the 6-byte variant? Do you know how "UTF-8 as defined by RFC 2279" should be spelled as as an encoding string, e.g. like the strings in the "Standard Encodings" table in Python's Footnotes |
don't think it supports it, but have not looked
not from the top of my mind, but I assume anything claiming to be utf-8 before RFC 3629 became a thing Looks like some Microsoft formats might be using it https://learn.microsoft.com/en-us/search/?scope=OpenSpecs&terms=RFC2279 |
Any example of these? AFAIK C0 and C1 control character can be represented in 4-byte UTF-8 (RFC 3629). Also see: https://en.wikipedia.org/wiki/C0_and_C1_control_codes AFAIK (1) Surrogates such as U+d800, (2) values (currently) not mapped to characters and (3) values beyond U+10FFFF are going to be the ones that need special treatment For (2) 4-byte UTF-8 should be able to encode these, but might not meet the "human-readable strings" criteria mentioned above One option could be to use "\U########" and "\u####" string notation for such characters. Based on https://www.w3.org/TR/xml/#dt-charref and https://www.w3.org/TR/xml/#wf-Legalchar I'm not 100% sure if XML character escape allows "&#d800"
Unfortunately ISO/IEC 10646 has evolved/changed over the years [1]. This write up provides some historical context https://www.cl.cam.ac.uk/~mgk25/unicode.html |
What I could find is that both XML 1.0 and 1.1 are strict about not allowing such characters https://www.w3.org/TR/2006/REC-xml-20060816/Overview.html#charsets and https://www.w3.org/TR/xml11/#charsets And if I read the following [1] correctly:
It could be that |
@ajnelson-nist a couple of more scenarios to consider
|
Looks like there is WTF-8 https://en.wikipedia.org/wiki/UTF-8#WTF-8 |
Nice.
—
Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAMFHLDWQNT2INJ6B72FZTLXPN2TLANCNFSM4FLEYBNA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
One of the software engineers raised a good point Python has pathlib for this as well which might also help cover the cp932 edge cases I mentioned |
The patch provides `unpack_wstring` with a return type annotation and revisits the referenced, committed patch. This is a prototype and may be addressed further in a future patch. Our understanding of type annotations for the fields the `declare_field` function creates is limited by its dynamic assignment of these variables. A type safe solution will be difficult to attain until `Block` and its subclasses are re-implemented into a `@property` getter/setter approach to provide static type review. References: * dfxml-working-group/dfxml_schema#34 * williballenthin@e96e04b Disclaimer: Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose. Licensing: Portions of this patch contributed by NIST are governed by the NIST Software Licensing Statement: NIST-developed software is provided by NIST as a public service. You may use, copy, and distribute copies of the software in any medium, provided that you keep intact this entire notice. You may improve, modify, and create derivative works of the software or any portion of the software, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the software and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the software. NIST-developed software is expressly provided "AS IS." NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED, IN FACT, OR ARISING BY OPERATION OF LAW, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND DATA ACCURACY. NIST NEITHER REPRESENTS NOR WARRANTS THAT THE OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE, OR THAT ANY DEFECTS WILL BE CORRECTED. NIST DOES NOT WARRANT OR MAKE ANY REPRESENTATIONS REGARDING THE USE OF THE SOFTWARE OR THE RESULTS THEREOF, INCLUDING BUT NOT LIMITED TO THE CORRECTNESS, ACCURACY, RELIABILITY, OR USEFULNESS OF THE SOFTWARE. You are solely responsible for determining the appropriateness of using and distributing the software and you assume all risks associated with its use, including but not limited to the risks and costs of program errors, compliance with applicable laws, damage to or loss of data, programs or equipment, and the unavailability or interruption of operation. This software is not intended to be used in any situation where a failure could cause risk of injury or damage to property. The software developed by NIST employees is not subject to copyright protection within the United States. Co-authored-by: Alex Nelson <[email protected]> Signed-off-by: Sheldon Douglas <[email protected]>
DFXML can be generated for file systems that do not use UTF-8 encoding, or even that use arbitrary bytes. (For instance, HFS (not HFS+), allows any byte in a file name except the ASCII colon character.) DFXML has three objectives in case these bytes are encountered:
Within the DFXML schema, these constructs will be added to any encodable/transcodable string:
original_encoding
" to indicate a transcoding occurred.original_bytes_base64
".Some new semantics will result from this, because there are now 8 states (represented as sets of 3 conditions) we can have for any of these transcoding states, based on presence or absence of (
A
) the original bytes in base64, (B
) the transcoding, and (C
) the element's child text. Let absence of one of these conditions be represented as an underscore below as we walk through the state space:___
] It could be that afileobject
was not meant to be named, such as with DFXML being used to distribute file hashes (e.g. with some modes ofhashdeep
).__C
] Absence of the original bytes and original encoding attributes implies the string was recorded in DFXML exactly as it was encountered._B_
] This state is a bit uninformative, and should be avoided._BC
] This implies original bytes were not recorded. This state should be avoided in case of an error with the script that performed the transcoding. If such an error occurred, the original bytes may not be derivable from the DFXML record.A__
] The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name. (H/t to @dd388 for finding a case where an HFS file system recorded "^C^B^AMove&Rename", special control characters for a Mac OS somewhere around version 7.)A_C
] If original bytes are present and UTF-8 text is recorded, this shall imply the original encoding was UTF-8. This may be desired in cases where a unicode character could be encoded in multiple ways, such as with unicode combining characters.AB_
] This state implies the original bytes are decodable, but do not have a corresponding point in unicode space.ABC
] All t's crossed, all i's dotted.In short, the preferred states are to include original bytes (conditions
A**
), and include UTF-8 encodings when they are reachable (conditions**C
). If there is nothing more complex than ASCII,__C
(omitting original bytes and original encoding) would be fine. If the character data are more complex than ASCII, and there is no chance of ambiguity, the original encoding can be omitted (conditionA_C
). If the file names are all ASCII, or unicode where all characters only have one representation, conditionA_C
would suffice; however, this may be unnecessarily difficult to determine on the fly, so unicode filenames may be best represented verbosely (conditionABC
).Thanks to @dd388 for assistance drafting this description, to @tw4l for raising the matter, and @simsong for discussion and an article on Programming in Unicode. The original structure proposed in this Issue is close to what came from discussion in the DFXML library Issue.
The text was updated successfully, but these errors were encountered: