The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

gouttegd · 2024-01-01T12:47:04Z

If the value of a mapping metadata slot happens to contain a tab character (0x09), how should that value be represented in the SSSOM/TSV format?

The spec currently says nothing about that, and we can’t rely on the description of the underlying TSV format, since the only somewhat “official” description of the TSV format (as registered to IANA) basically washes its hands of the issue by simply forbidding tab characters inside the fields of a TSV file (“fields that contain tabs are not allowable in this encoding”).

It may be unlikely that a metadata slot will ever contain tabs, but that could happen anyway and in that case it’d be nice if all implementations behave similarly.

Possible options:

A) We strictly follow the IANA description of the TSV format: tab characters are not allowed anywhere within the fields of a SSSOM/TSV file. A SSSOM/TSV writer should refuse to write a mapping set where mapping metadata values contain tab characters.

B) C-style escaping: A SSSOM/TSV writer must write tab characters within a field as \t. A SSSOM/TSV parser must recognise such escaped sequences when parsing and convert them back to a normal tab character.

C) CSV-style quoting: We apply to TSV the same quoting rules as defined for the CSV format (RFC 4180). If the value of a field contains tab characters, line breaks, and/or double quotes, then the entire field must be written between double quotes; internal double quotes must be escaped by being preceded by another double quote. A SSSOM/TSV parser must recognise a quoted field and remove the quotes as needed to pass to the application the real value of the field.

In effect, sssom-py (or rather the underlying Pandas library) is already supporting CSV-style quoting both for reading and writing. sssom-java supports CSV-style quoting when reading (not yet when writing). So I believe the spec should formalise option C as the expected behaviour.

The text was updated successfully, but these errors were encountered:

matentzn · 2024-01-03T14:52:24Z

If you think C is the most widely understood option, I give my support as well.

gouttegd · 2024-01-03T17:09:09Z

Not sure if it is the “most widely understood” (I didn’t test all the available TSV parsing libraries out there! :D ), but that’s for sure the one understood by both pandas (used under the hood by SSSOM-Py as far as I know) and FasterXML’s Jackson (used under the hood by SSSOM-Java).

Both libraries also support C-style escaping (option B), but they both default to CSV-style quoting. Developers have to explicitly select C-style escaping if that’s want they want to use.

Given that both SSSOM-Py and SSSOM-Java have been implicitly using CSV-style quoting from the beginning, I see no reason to change that now. So if we opt for (C) then the only action item needed here is a spec/documentation update to officially “ratify” what is the de facto standard behaviour.

matentzn · 2024-01-03T18:55:15Z

Ok, sounds good!

gouttegd · 2024-07-11T10:16:37Z

Done as part of #368.

gouttegd self-assigned this Jan 1, 2024

gouttegd closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

gouttegd commented Jan 1, 2024

matentzn commented Jan 3, 2024

gouttegd commented Jan 3, 2024

matentzn commented Jan 3, 2024

gouttegd commented Jul 11, 2024

The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

Comments

gouttegd commented Jan 1, 2024

matentzn commented Jan 3, 2024

gouttegd commented Jan 3, 2024

matentzn commented Jan 3, 2024

gouttegd commented Jul 11, 2024