CSV output for `sourmash search` needs upgrading #1390

ctb · 2021-03-12T15:34:45Z

(Some of this might be 5.0 material, because they change the file format in backwards-incompatible ways)

A few issues --

in [MRG] add max_containment to MinHash class. #1346 (review), @bluegenes notes that CSV output contains the header 'similarity' and sez "It would be nice to modify similarity to containment / max_containment for csv output" when --max-containment or --containment are specified
note also we can't always provide containment numbers, since search supports regular MinHashes

related to #1247, #410, and #448.

It's not really clear what to do here. The addition of prefetch #1370 might provide a useful alternative here, and/or we could provide JSON output that has more ...flexibility per #448.

The text was updated successfully, but these errors were encountered:

ctb · 2021-03-12T15:35:58Z

Oh, and also, we're inconsistent with md5sum output per #1346 (review)

ctb · 2022-04-27T14:07:32Z

we could/should also consider including the metric used - jaccard, containment, max containment.

and/or just, like, calculate all of those.

ctb · 2022-04-27T14:08:12Z

(sigh, for scaled sketches; more than jaccard not possible with regular MinHash)

luizirber · 2022-04-27T15:10:47Z

CSVs are hard (impossible?) to version, but we should have some way of doing that too. Or do we just keep ever growing the CSV and never removing columns? 🙃

ctb · 2022-04-27T15:19:59Z

thoughts on approach in #1555?

Basically, I think it's OK to pin column names to sourmash versions, with appropriate deprecation approaches and command-line upgrade flags. That fits with their use in workflows.

In manifests, we are using:

# SOURMASH-MANIFEST-VERSION: 1.0

but I'm pretty confident that this breaks pandas/Python header detection, sigh. IMO it was OK to do this for manifests because these are not intended to be end-user-consumable.

#416 has the idea of building standard pandas/CSV loading functions for sourmash output, which is something I'm trying out over in genome-grist for gather output - dib-lab/genome-grist#176. But I'd be loathe to break all CSV readers everywhere :(.

I guess... we could include a "version for this CSV format" in the first column in the first row, and leave that column blank, or something? or do the same but for the last column in the first row (so, less visible, but leaving it blank is less annoying for manual inspection of the CSV). This would make it a header but that's ok.

ctb added the 5.0 issues to address for a 5.0 release label Mar 12, 2021

ctb mentioned this issue Mar 12, 2021

[MRG] add max_containment to MinHash class. #1346

Merged

11 tasks

ctb mentioned this issue Apr 27, 2022

upgrade search to display more information? #2002

Open

luizirber added this to the 5.0 milestone Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV output for `sourmash search` needs upgrading #1390

CSV output for `sourmash search` needs upgrading #1390

ctb commented Mar 12, 2021

ctb commented Mar 12, 2021

ctb commented Apr 27, 2022

ctb commented Apr 27, 2022

luizirber commented Apr 27, 2022

ctb commented Apr 27, 2022

CSV output for sourmash search needs upgrading #1390

CSV output for sourmash search needs upgrading #1390

Comments

ctb commented Mar 12, 2021

ctb commented Mar 12, 2021

ctb commented Apr 27, 2022

ctb commented Apr 27, 2022

luizirber commented Apr 27, 2022

ctb commented Apr 27, 2022

CSV output for `sourmash search` needs upgrading #1390

CSV output for `sourmash search` needs upgrading #1390