You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the value of a mapping metadata slot happens to contain a tab character (0x09), how should that value be represented in the SSSOM/TSV format?
The spec currently says nothing about that, and we can’t rely on the description of the underlying TSV format, since the only somewhat “official” description of the TSV format (as registered to IANA) basically washes its hands of the issue by simply forbidding tab characters inside the fields of a TSV file (“fields that contain tabs are not allowable in this encoding”).
It may be unlikely that a metadata slot will ever contain tabs, but that could happen anyway and in that case it’d be nice if all implementations behave similarly.
Possible options:
A) We strictly follow the IANA description of the TSV format: tab characters are not allowed anywhere within the fields of a SSSOM/TSV file. A SSSOM/TSV writer should refuse to write a mapping set where mapping metadata values contain tab characters.
B) C-style escaping: A SSSOM/TSV writer must write tab characters within a field as \t. A SSSOM/TSV parser must recognise such escaped sequences when parsing and convert them back to a normal tab character.
C) CSV-style quoting: We apply to TSV the same quoting rules as defined for the CSV format (RFC 4180). If the value of a field contains tab characters, line breaks, and/or double quotes, then the entire field must be written between double quotes; internal double quotes must be escaped by being preceded by another double quote. A SSSOM/TSV parser must recognise a quoted field and remove the quotes as needed to pass to the application the real value of the field.
In effect, sssom-py (or rather the underlying Pandas library) is already supporting CSV-style quoting both for reading and writing. sssom-java supports CSV-style quoting when reading (not yet when writing). So I believe the spec should formalise option C as the expected behaviour.
The text was updated successfully, but these errors were encountered:
Not sure if it is the “most widely understood” (I didn’t test all the available TSV parsing libraries out there! :D ), but that’s for sure the one understood by both pandas (used under the hood by SSSOM-Py as far as I know) and FasterXML’s Jackson (used under the hood by SSSOM-Java).
Both libraries also support C-style escaping (option B), but they both default to CSV-style quoting. Developers have to explicitly select C-style escaping if that’s want they want to use.
Given that both SSSOM-Py and SSSOM-Java have been implicitly using CSV-style quoting from the beginning, I see no reason to change that now. So if we opt for (C) then the only action item needed here is a spec/documentation update to officially “ratify” what is the de facto standard behaviour.
If the value of a mapping metadata slot happens to contain a tab character (0x09), how should that value be represented in the SSSOM/TSV format?
The spec currently says nothing about that, and we can’t rely on the description of the underlying TSV format, since the only somewhat “official” description of the TSV format (as registered to IANA) basically washes its hands of the issue by simply forbidding tab characters inside the fields of a TSV file (“fields that contain tabs are not allowable in this encoding”).
It may be unlikely that a metadata slot will ever contain tabs, but that could happen anyway and in that case it’d be nice if all implementations behave similarly.
Possible options:
A) We strictly follow the IANA description of the TSV format: tab characters are not allowed anywhere within the fields of a SSSOM/TSV file. A SSSOM/TSV writer should refuse to write a mapping set where mapping metadata values contain tab characters.
B) C-style escaping: A SSSOM/TSV writer must write tab characters within a field as
\t
. A SSSOM/TSV parser must recognise such escaped sequences when parsing and convert them back to a normal tab character.C) CSV-style quoting: We apply to TSV the same quoting rules as defined for the CSV format (RFC 4180). If the value of a field contains tab characters, line breaks, and/or double quotes, then the entire field must be written between double quotes; internal double quotes must be escaped by being preceded by another double quote. A SSSOM/TSV parser must recognise a quoted field and remove the quotes as needed to pass to the application the real value of the field.
In effect,
sssom-py
(or rather the underlying Pandas library) is already supporting CSV-style quoting both for reading and writing.sssom-java
supports CSV-style quoting when reading (not yet when writing). So I believe the spec should formalise option C as the expected behaviour.The text was updated successfully, but these errors were encountered: