Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The SSSOM/TSV spec should specify how to deal with tab characters in the TSV section #342

Closed
gouttegd opened this issue Jan 1, 2024 · 4 comments
Assignees

Comments

@gouttegd
Copy link
Contributor

gouttegd commented Jan 1, 2024

If the value of a mapping metadata slot happens to contain a tab character (0x09), how should that value be represented in the SSSOM/TSV format?

The spec currently says nothing about that, and we can’t rely on the description of the underlying TSV format, since the only somewhat “official” description of the TSV format (as registered to IANA) basically washes its hands of the issue by simply forbidding tab characters inside the fields of a TSV file (“fields that contain tabs are not allowable in this encoding”).

It may be unlikely that a metadata slot will ever contain tabs, but that could happen anyway and in that case it’d be nice if all implementations behave similarly.

Possible options:

A) We strictly follow the IANA description of the TSV format: tab characters are not allowed anywhere within the fields of a SSSOM/TSV file. A SSSOM/TSV writer should refuse to write a mapping set where mapping metadata values contain tab characters.

B) C-style escaping: A SSSOM/TSV writer must write tab characters within a field as \t. A SSSOM/TSV parser must recognise such escaped sequences when parsing and convert them back to a normal tab character.

C) CSV-style quoting: We apply to TSV the same quoting rules as defined for the CSV format (RFC 4180). If the value of a field contains tab characters, line breaks, and/or double quotes, then the entire field must be written between double quotes; internal double quotes must be escaped by being preceded by another double quote. A SSSOM/TSV parser must recognise a quoted field and remove the quotes as needed to pass to the application the real value of the field.

In effect, sssom-py (or rather the underlying Pandas library) is already supporting CSV-style quoting both for reading and writing. sssom-java supports CSV-style quoting when reading (not yet when writing). So I believe the spec should formalise option C as the expected behaviour.

@gouttegd gouttegd self-assigned this Jan 1, 2024
@matentzn
Copy link
Collaborator

matentzn commented Jan 3, 2024

If you think C is the most widely understood option, I give my support as well.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

Not sure if it is the “most widely understood” (I didn’t test all the available TSV parsing libraries out there! :D ), but that’s for sure the one understood by both pandas (used under the hood by SSSOM-Py as far as I know) and FasterXML’s Jackson (used under the hood by SSSOM-Java).

Both libraries also support C-style escaping (option B), but they both default to CSV-style quoting. Developers have to explicitly select C-style escaping if that’s want they want to use.

Given that both SSSOM-Py and SSSOM-Java have been implicitly using CSV-style quoting from the beginning, I see no reason to change that now. So if we opt for (C) then the only action item needed here is a spec/documentation update to officially “ratify” what is the de facto standard behaviour.

@matentzn
Copy link
Collaborator

matentzn commented Jan 3, 2024

Ok, sounds good!

@gouttegd
Copy link
Contributor Author

Done as part of #368.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants