Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add flexible & iterative support for outputting signatures in variety of formats #1493

Merged
merged 19 commits into from
May 5, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented May 1, 2021

This PR adds support for a variety of output match formats using sourmash_args.SaveSignaturesToLocation(...) -

  • in sourmash search and sourmash gather when using --save-matches.
  • in most sourmash signature functions
  • potentially in sourmash compute and sourmash sketch

This does not support indexed format output like .sbt.zip or .lca.json because those only work with one ksize/moltype/etc.

This functionalty was requested by @bluegenes in #1440.

usage

if --save-matches is given a filename with a trailing /, it saves matches as files in that directory.

for example,

sourmash prefetch -k 31 tests/test-data/47.fa.sig tests/test-data/63.fa.sig  --save-matches=bar/

saves the matching signature as bar/38729c63.sig.gz. Similarly, this works for .zip and .sig and .sig.gz files, making the appropriate format choices as you'd expect.

This class also lets us avoid keeping matching signatures in memory where possible, which is extra nice for large databases / long-running queries such as those supported by prefetch in #1370; ref also better UX for massive collections #1350.

misc TODO:

@codecov
Copy link

codecov bot commented May 1, 2021

Codecov Report

Merging #1493 (c158c69) into latest (5b698a6) will increase coverage by 5.17%.
The diff coverage is 96.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1493      +/-   ##
==========================================
+ Coverage   89.75%   94.93%   +5.17%     
==========================================
  Files         123       97      -26     
  Lines       19579    16266    -3313     
  Branches     1498     1515      +17     
==========================================
- Hits        17574    15442    -2132     
+ Misses       1778      595    -1183     
- Partials      227      229       +2     
Flag Coverage Δ
python 94.93% <96.59%> (+0.05%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/cli/sig/cat.py 100.00% <ø> (ø)
src/sourmash/cli/sig/downsample.py 100.00% <ø> (ø)
src/sourmash/cli/sig/extract.py 100.00% <ø> (ø)
src/sourmash/cli/sig/filter.py 100.00% <ø> (ø)
src/sourmash/cli/sig/flatten.py 100.00% <ø> (ø)
src/sourmash/cli/sig/rename.py 100.00% <ø> (ø)
src/sourmash/sourmash_args.py 94.37% <90.97%> (-1.67%) ⬇️
src/sourmash/command_compute.py 95.09% <100.00%> (+0.13%) ⬆️
src/sourmash/commands.py 84.52% <100.00%> (+0.68%) ⬆️
src/sourmash/sig/__main__.py 90.79% <100.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b698a6...c158c69. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented May 2, 2021

This also now fixes #1428 by specifying utf-8 as the default output encoding. Is this a good idea, @luizirber? All the tests pass with only one rather simple fix needed, so... 🤷

A remaining nit in this PR is a Rust panic triggered in the sourmash compute code -- it appears that asking for an md5sum representation of a signature containing multiple MinHashes is not handled in the rust code. @luizirber is there any possible fix for this or should I abandon this functionality for output signatures for now?

error triggered by test test_do_sourmash_sketchdna_output_zipfile:

Traceback (most recent call last):
  File "/Users/t/dev/sourmash/tests/sourmash_tst_utils.py", line 105, in runscript  
    status = _runscript(scriptname)
  File "/Users/t/dev/sourmash/tests/sourmash_tst_utils.py", line 59, in _runscript  
    namespace)
  File "/Users/t/miniconda3/envs/py37/bin/sourmash", line 33, in <module>
    sys.exit(load_entry_point('sourmash', 'console_scripts', 'sourmash')())
  File "/Users/t/dev/sourmash/src/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/Users/t/dev/sourmash/src/sourmash/cli/sketch/dna.py", line 89, in main
    return sourmash.command_sketch.dna(args)
  File "/Users/t/dev/sourmash/src/sourmash/command_sketch.py", line 216, in dna
    _execute_sketch(args, signatures_factory)
  File "/Users/t/dev/sourmash/src/sourmash/command_sketch.py", line 196, in _execute_sketch
    _compute_individual(args, signatures_factory)
  File "/Users/t/dev/sourmash/src/sourmash/command_compute.py", line 214, in _compute_individual
    save_siglist(siglist, args.output)
  File "/Users/t/dev/sourmash/src/sourmash/command_compute.py", line 273, in save_siglist
    save_sig.add(ss)
  File "/Users/t/dev/sourmash/src/sourmash/sourmash_args.py", line 670, in add
    md5 = ss.md5sum()
  File "/Users/t/dev/sourmash/src/sourmash/signature.py", line 82, in md5sum
    return decode_str(self.minhash._methodcall(lib.kmerminhash_md5sum))
  File "/Users/t/dev/sourmash/src/sourmash/signature.py", line 46, in minhash
    self._methodcall(lib.signature_first_mh)
  File "/Users/t/dev/sourmash/src/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/Users/t/dev/sourmash/src/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'not implemented' at src/core/src/ffi/signature.rs:172

@ctb ctb changed the title [WIP] add flexible & iterative support for saving signatures in variety of formats [WIP] add flexible & iterative support for outputting signatures in variety of formats May 3, 2021
@ctb
Copy link
Contributor Author

ctb commented May 3, 2021

A remaining nit in this PR is a Rust panic triggered in the sourmash compute code -- it appears that asking for an md5sum representation of a signature containing multiple MinHashes is not handled in the rust code. @luizirber is there any possible fix for this or should I abandon this functionality for output signatures for now?

error triggered by test test_do_sourmash_sketchdna_output_zipfile:

...

File "/Users/t/dev/sourmash/src/sourmash/utils.py", line 78, in rustcall
raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'not implemented' at src/core/src/ffi/signature.rs:172

ah-hah, found the old issue - #1167 - and also #1159 (comment) and #616. I thought it was familiar!

I think I'm going to hack and slash in a fix at the Python layer, for now.

@ctb
Copy link
Contributor Author

ctb commented May 3, 2021

ready for review!

@ctb ctb changed the title [WIP] add flexible & iterative support for outputting signatures in variety of formats [MRG] add flexible & iterative support for outputting signatures in variety of formats May 3, 2021
Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-detect desired output format and make it happen:: a beautiful UX that I'm looking forward to using!!

Mostly made some minor comment suggestions. My main concern is duplicated md5sums, since I seem to run into that fairly often, even with small(er) databases.

I don't have much to say on the implementation -- looks clean and clear to me, and I learned about super() and class inheritance :). Not sure if that means Luiz's eyes will be more helpful!

doc/command-line.md Outdated Show resolved Hide resolved
src/sourmash/cli/sig/rename.py Outdated Show resolved Hide resolved
tests/test_sourmash_args.py Outdated Show resolved Hide resolved
tests/test_sourmash_args.py Outdated Show resolved Hide resolved
tests/test_sourmash_args.py Outdated Show resolved Hide resolved
tests/test_sourmash_args.py Outdated Show resolved Hide resolved
tests/test_sourmash_args.py Outdated Show resolved Hide resolved
doc/command-line.md Show resolved Hide resolved
assert self.zf
super().add(ss)

md5 = ss.md5sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about using md5sums for zipfiles, given that we have duplicate md5sums reasonably often in our databases.

Can we implement something to avoid overwriting duplicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, will work on it as a PR into this 'un!

ctb and others added 8 commits May 3, 2021 21:08
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
@luizirber
Copy link
Member

LGTM. I think this could also support the scanpy-like usage described in #1353 (comment) and #939 (comment)

@ctb
Copy link
Contributor Author

ctb commented May 5, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants