Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add StandaloneManifestIndex class for direct loading of manifest CSVs #1891

Merged
merged 45 commits into from
Mar 28, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 22, 2022

This PR enables direct loading of manifest CSV files through the addition of a StandaloneManifestIndex class. This class supports manifest operations directly, and implements lazy loading of files within manifests upon demand.

This is a simpler alternative to DirectoryIndex from #1619.

A future effort is to use this to better support manifests-of-manifests functionality #1671, and/or implement a sqlite-based manifest storage, perhaps in conjunction with #1808.

This PR:

  • creates a new Index class, StandaloneManifestIndex, and enables its loading from the command line.
  • somewhat better documents some of the internal aspects of Index implementation.

Closes #1096
Closes #1641

TODO

  • write docs
  • write some basic tests for searching/loading/etc. a la test_index.
  • add tests for 'location' reporting, and add info to docstrings.
  • make sure we have a redundant test for lazy load that checks actual signatures, vs the existing test_sig_describe_3_manifest_fails_when_moved
  • think about how to produce better errors from pathlists, manifests with unresolvable signature locations, etc. (ref loading signatures from pathlist fails confusingly if pathlist contains bad paths #1845)

Example

This PR lets you do:

sourmash sig manifest tests/test-data/track-abund -o tests/test-data/track-abund/mf.csv
...
sourmash sig describe tests/test-data/track-abund/mf.csv

that is, tests/test-data/track-abund/mf.csv becomes a loadable sourmash collection. 😎

Importantly, that collection contains a manifest, because it is a manifest. This means that all the signature selection commands work on it without necessarily loading the files underneath.

If you give sourmash sig manifest a directory, it will produce the manifest relative to that directory - that is, the manifest internal locations will all be relative to that top-level directory.

That's useful but... the proximal motivation is to help with #1671. And for that, the cool functionality provided here is for absolute paths. If you do:

ls -1 $(pwd)/tests/test-data/track_abund/* > test-pathlist.txt
sourmash sig manifest test-pathlist.txt -o test-pathlist.mf.csv

now you have a manifest containing absolute paths. This is obviously much less portable, but it is extremely useful when you have very large collections of signatures sitting somewhere and you want to be able to operate on them without reloading the files (i.e., reparsing the JSON, or reloading all of the manifests from each file).

I don't currently think we should try to a one-step CLI mechanism to generate a manifest with absolute paths, because it's going to lead to ...challenges

@codecov
Copy link

codecov bot commented Mar 22, 2022

Codecov Report

Merging #1891 (26e919b) into latest (aef0036) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #1891      +/-   ##
==========================================
+ Coverage   82.74%   82.83%   +0.08%     
==========================================
  Files         122      122              
  Lines       13203    13257      +54     
  Branches     1779     1789      +10     
==========================================
+ Hits        10925    10981      +56     
+ Misses       2014     2013       -1     
+ Partials      264      263       -1     
Flag Coverage Δ
python 90.73% <100.00%> (+0.07%) ⬆️
rust 65.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index/__init__.py 96.75% <100.00%> (+0.29%) ⬆️
src/sourmash/lca/lca_db.py 91.30% <100.00%> (+0.02%) ⬆️
src/sourmash/sourmash_args.py 93.55% <100.00%> (+0.04%) ⬆️
src/sourmash/manifest.py 93.54% <0.00%> (+1.61%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aef0036...26e919b. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Mar 23, 2022

Hmm, I wonder if this is just another use case for MultiIndex? Except that MultiIndex keeps the signatures in memory, I think. So this is a lazy loading version.

@ctb ctb changed the title [EXP] Add StandaloneManifestIndex class for direct loading of manifest CSVs [WIP] Add StandaloneManifestIndex class for direct loading of manifest CSVs Mar 25, 2022
@ctb ctb changed the title [WIP] Add StandaloneManifestIndex class for direct loading of manifest CSVs [MRG] Add StandaloneManifestIndex class for direct loading of manifest CSVs Mar 27, 2022
@ctb
Copy link
Contributor Author

ctb commented Mar 27, 2022

This is ready for review & merge, @sourmash-bio/devs.

doc/command-line.md Outdated Show resolved Hide resolved
Co-authored-by: Tessa Pierce Ward <[email protected]>
Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

doc/command-line.md Show resolved Hide resolved
src/sourmash/index/__init__.py Outdated Show resolved Hide resolved
src/sourmash/index/__init__.py Outdated Show resolved Hide resolved
@bluegenes
Copy link
Contributor

Some mechanism to change/reset paths in a manifest might be nice in the future, to help switch between abs paths and rel paths. Or not, at least you're very clear about the dir restriction for rel paths for dir collections!

@ctb
Copy link
Contributor Author

ctb commented Mar 28, 2022

Some mechanism to change/reset paths in a manifest might be nice in the future, to help switch between abs paths and rel paths. Or not, at least you're very clear about the dir restriction for rel paths for dir collections!

Yes! I kind of left it up in the air (beyond documenting it...) because as soon as I started to think about nailing it down further, it became complicated. I opted for defaulting to the "hey you can generate a manifest for a directory! just put it under the TLD!" functionality. The good news is it's all CSV format so it's easy to manipulate the paths if you really have to.

@ctb ctb merged commit f05e4bd into latest Mar 28, 2022
@ctb ctb deleted the add/manifestindex branch March 28, 2022 23:36
@ctb
Copy link
Contributor Author

ctb commented Mar 28, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants