Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV output for sourmash search needs upgrading #1390

Open
ctb opened this issue Mar 12, 2021 · 5 comments
Open

CSV output for sourmash search needs upgrading #1390

ctb opened this issue Mar 12, 2021 · 5 comments
Labels
5.0 issues to address for a 5.0 release
Milestone

Comments

@ctb
Copy link
Contributor

ctb commented Mar 12, 2021

(Some of this might be 5.0 material, because they change the file format in backwards-incompatible ways)

A few issues --

related to #1247, #410, and #448.

It's not really clear what to do here. The addition of prefetch #1370 might provide a useful alternative here, and/or we could provide JSON output that has more ...flexibility per #448.

@ctb ctb added the 5.0 issues to address for a 5.0 release label Mar 12, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

Oh, and also, we're inconsistent with md5sum output per #1346 (review)

@ctb
Copy link
Contributor Author

ctb commented Apr 27, 2022

we could/should also consider including the metric used - jaccard, containment, max containment.

and/or just, like, calculate all of those.

@ctb
Copy link
Contributor Author

ctb commented Apr 27, 2022

(sigh, for scaled sketches; more than jaccard not possible with regular MinHash)

@luizirber
Copy link
Member

CSVs are hard (impossible?) to version, but we should have some way of doing that too. Or do we just keep ever growing the CSV and never removing columns? 🙃

@ctb
Copy link
Contributor Author

ctb commented Apr 27, 2022

thoughts on approach in #1555?

Basically, I think it's OK to pin column names to sourmash versions, with appropriate deprecation approaches and command-line upgrade flags. That fits with their use in workflows.

In manifests, we are using:

# SOURMASH-MANIFEST-VERSION: 1.0

but I'm pretty confident that this breaks pandas/Python header detection, sigh. IMO it was OK to do this for manifests because these are not intended to be end-user-consumable.

#416 has the idea of building standard pandas/CSV loading functions for sourmash output, which is something I'm trying out over in genome-grist for gather output - dib-lab/genome-grist#176. But I'd be loathe to break all CSV readers everywhere :(.

I guess... we could include a "version for this CSV format" in the first column in the first row, and leave that column blank, or something? or do the same but for the last column in the first row (so, less visible, but leaving it blank is less annoying for manual inspection of the CSV). This would make it a header but that's ok.

@luizirber luizirber added this to the 5.0 milestone Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

2 participants