Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add StandaloneManifestIndex class for direct loading of manifest CSVs #1891

Merged
merged 45 commits into from
Mar 28, 2022
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
cc0db3a
add -d/--debug to various commands
ctb Jan 4, 2022
906ef0b
initial implementation of StandaloneManifestIndex
ctb Mar 22, 2022
72e9523
support prefix if not abspath
ctb Mar 22, 2022
0d79fb6
clean up
ctb Mar 23, 2022
b65e428
some standalone manifests tests - incl CLI
ctb Mar 23, 2022
56a31ad
iterate over internal locations instead
ctb Mar 23, 2022
1cfaab8
switch to picklist API
ctb Mar 23, 2022
da27a1b
aaaaand swap out for load_file_as_index :tada:
ctb Mar 23, 2022
a031d57
remove unnecessary spaces
ctb Mar 23, 2022
9a939a1
more tests
ctb Mar 23, 2022
47b1f40
more better prefix test
ctb Mar 23, 2022
a75815e
remove unnec space
ctb Mar 24, 2022
8a77943
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 25, 2022
e1a1975
upgrade output error messages
ctb Mar 25, 2022
da682b2
Merge branch 'add/debug' into add/manifestindex
ctb Mar 25, 2022
6104a18
fix SBT subdir loading error
ctb Mar 25, 2022
d9d3bff
add message about using --debug
ctb Mar 25, 2022
112dd3b
Merge branch 'add/debug' into add/manifestindex
ctb Mar 25, 2022
e34bd1a
Merge branch 'add/test_sbt_load_fail' into add/manifestindex
ctb Mar 25, 2022
87b72b8
doc etc
ctb Mar 25, 2022
f4546de
rationalize _signatures_with_internal
ctb Mar 25, 2022
cd9e670
test describe and fileinfo on manifests
ctb Mar 25, 2022
5147dc4
think through more manifest stuff
ctb Mar 25, 2022
50e87c2
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 25, 2022
59cfe9f
fix descr
ctb Mar 25, 2022
84b200b
rationalize _signatures_with_internal
ctb Mar 25, 2022
bdff48b
Merge branch 'refactor/mf_internal' into add/manifestindex
ctb Mar 25, 2022
3b591b8
fix docstring
ctb Mar 25, 2022
7e6caa9
add heading anchors config; fix napoleon package ref
ctb Mar 25, 2022
785e7c9
pin versions for doc building
ctb Mar 25, 2022
3e6872a
fix internal refs
ctb Mar 25, 2022
e8763b9
fix one last ref target
ctb Mar 25, 2022
6ead927
add docs
ctb Mar 25, 2022
85b2c12
clarify language
ctb Mar 25, 2022
de0b7b2
add docs
ctb Mar 25, 2022
b03ba2f
add more/better tests for lazy loading
ctb Mar 25, 2022
c1ada69
clarify
ctb Mar 25, 2022
1bd133d
a few more tests
ctb Mar 25, 2022
ab882cc
Merge branch 'fix/docs' into add/manifestindex
ctb Mar 26, 2022
c6a7e24
update docs
ctb Mar 26, 2022
38593b6
add explicit test for lazy-loading prefetch on StandaloneManifestIndex
ctb Mar 26, 2022
c126437
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 27, 2022
a463fd0
update comments/docstrings
ctb Mar 27, 2022
da093e3
Update doc/command-line.md
ctb Mar 28, 2022
26e919b
update comments/docstrings
ctb Mar 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -1621,3 +1621,68 @@ sig` commands will output to stdout. So, for example,

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.

### Using manifests to explicitly refer to collections of files

(sourmash v4.4.0 and later)

Manifests are metadata catalogs of signatures that are used for
signature selection and loading. They are used extensively by sourmash
internals to speed up signature selection through picklists and
pattern matching.

Manifests can _also_ be used externally (via the command-line), and
may be useful for organizing large collections of signatures.

Suppose you have a large collection of signature (`.sig` or `.sig.gz`
files) under a directory. You can create a manifest file for them like so:
```
sourmash sig manifest <dir> -o <dir>/manifest.csv
```
and then use the manifest directly for sourmash operations:
```
sourmash sig fileinfo <dir>/manifest.csv
```
This manifest can be used as a database target for most sourmash
operations - search, gather, etc. Note that manifests for directories
must be placed within (and loaded from) the directory from which the
manifest was generated; the specific manifest filename does not
matter.

A more advanced and slightly tricky way to use explicit manifest files
is with lists of files. If you create a file with a path list
containing the locations of loadable sourmash collections, you can run
`sourmash sig manifest pathlist.txt -o mf.csv` to generate a manifest
of all of the files. The resulting manifest in `mf.csv` can then be
loaded directly. This is very handy when you have many sourmash
signatures, or large signature files. The tricky part in doing this
is that the manifest will store the same paths listed in the pathlist
file - whether they are relative or absolute paths - and these paths
must be resolvable by sourmash from the current working directory.
This makes explicit manifests built from pathlist files less portable
within or across systems than the other sourmash collections, which
are all relocatable.

For example, if you create a pathlist file `paths.txt` containing the
following:
```
/path/to/zipfile.zip
local_directory/some_signature.sig.gz
local_dir2/
```
and then run:
```
sourmash sig manifest paths.txt -o mf.csv
```
you will be able to use `mf.csv` as a database for `sourmash search`
and `sourmash gather` commands. But, because it contains two relative paths,
you will only be able to use it _from the directory that contains those
two relative paths_.

**Our advice:** We suggest using zip file collections for most
situations; we primarily recommend using explicit manifests for
situations where you have a **very large** collection of signatures
(1000s or more), and don't want to make multiple copies of signatures
in the collection (as you would have to, with a zipfile). This can be
useful if you want to refer to different subsets of the collection
without making multiple copies in a zip file.
ctb marked this conversation as resolved.
Show resolved Hide resolved
147 changes: 143 additions & 4 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,14 @@

ZipFileLinearIndex - simple on-disk storage of signatures.

class MultiIndex - in-memory storage and selection of signatures from multiple
index objects, using manifests.
MultiIndex - in-memory storage and selection of signatures from multiple
index objects, using manifests. All signatures are kept in memory.

LazyLoadedIndex - selection on manifests with loading of index on demand.

StandaloneManifestIndex - load manifests directly, and do lazy loading of
signatures on demand. No signatures are kept in memory.
ctb marked this conversation as resolved.
Show resolved Hide resolved

CounterGather - an ancillary class returned by the 'counter_gather()' method.
"""

Expand All @@ -39,6 +42,7 @@ class MultiIndex - in-memory storage and selection of signatures from multiple
from collections import namedtuple, Counter
import csv
from io import TextIOWrapper
from collections import defaultdict

from ..search import make_jaccard_search_query, make_gather_query
from ..manifest import CollectionManifest
Expand All @@ -49,7 +53,12 @@ class MultiIndex - in-memory storage and selection of signatures from multiple
IndexSearchResult = namedtuple('Result', 'score, signature, location')

class Index(ABC):
# this will be removed soon; see sourmash#1894.
is_database = False

# 'manifest', when set, implies efficient selection and direct
# access to signatures. Signatures may be stored in the manifest
# or loaded on demand from disk depending on the class, however.
manifest = None

@abstractmethod
Expand Down Expand Up @@ -933,6 +942,11 @@ def sigloc_iter():

# build manifest; note, signatures are stored in memory.
# CTB: could do this on demand?
# CTB: should we use get_manifest functionality?
# CTB: note here that the manifest is created by iteration
# *even if it already exists.* This could be changed to be more
# efficient... but for now, use StandaloneManifestIndex if you
# want to avoid this when loading from multiple files.
manifest = CollectionManifest.create_manifest(sigloc_iter())

# create!
Expand All @@ -945,6 +959,8 @@ def load_from_directory(cls, pathname, *, force=False):
Takes directory path plus optional boolean 'force'. Attempts to
load all files ending in .sig or .sig.gz, by default; if 'force' is
True, will attempt to load _all_ files, ignoring errors.

Will not load anything other than JSON signature files.
"""
from ..sourmash_args import traverse_find_sigs

Expand Down Expand Up @@ -1007,8 +1023,8 @@ def load_from_path(cls, pathname, force=False):
def load_from_pathlist(cls, filename):
"""Create a MultiIndex from all files listed in a text file.

Note: this will load signatures from directories and databases, too,
if they are listed in the text file; it uses 'load_file_as_index'
Note: this will attempt to load signatures from each file,
including zip collections, etc; it uses 'load_file_as_index'
underneath.
"""
from ..sourmash_args import (load_pathlist_from_file,
Expand Down Expand Up @@ -1139,3 +1155,126 @@ def select(self, **kwargs):
new_manifest = manifest.select_to_manifest(**kwargs)

return LazyLoadedIndex(self.filename, new_manifest)


class StandaloneManifestIndex(Index):
"""Load a standalone manifest as an Index.

This class is useful for the situtation where you have a directory
ctb marked this conversation as resolved.
Show resolved Hide resolved
with many signature collections underneath it, and you don't want to load
every collection each time you run sourmash.

Instead, you can run 'sourmash sig manifest <directory> -o mf.csv' to
output a manifest and then use this class to load 'mf.csv' directly.
Sketch type selection, picklists, and pattern matching will all work
directly on the manifest and will load signatures only upon demand.

One feature of this class is that absolute paths to sketches in
the 'internal_location' field of the manifests will be loaded properly.
This permits manifests to be constructed for various collections of
signatures that reside elsewhere, and not just below a single directory
prefix.

StandaloneManifestIndex does _not_ store signatures in memory.

This class overlaps in concept with LazyLoadedIndex and behaves
identically when a manifest contains only rows from a single
on-disk Index object. However, unlike LazyLoadedIndex, this class
can be used to reference multiple on-disk Index objects.

This class also overlaps in concept with MultiIndex when
MultiIndex.load_from_pathlist is used to load other Index
objects. However, this class does not store any signatures in
memory, unlike MultiIndex.
"""
is_database = True

def __init__(self, manifest, location, *, prefix=None):
"""Create object. 'location' is path of manifest file, 'prefix' is
prepended to signature paths when loading non-abspaths."""
assert manifest is not None
self.manifest = manifest
self._location = location
self.prefix = prefix

@classmethod
def load(cls, location, *, prefix=None):
"""Load manifest file from given location.

If prefix is None (default), it is automatically set from dirname.
Set prefix='' to avoid this, or provide an explicit prefix.
"""
if not os.path.isfile(location):
raise ValueError(f"provided manifest location '{location}' is not a file")

with open(location, newline='') as fp:
m = CollectionManifest.load_from_csv(fp)

if prefix is None:
prefix = os.path.dirname(location)

return cls(m, location, prefix=prefix)

@property
def location(self):
"Return the path to this manifest."
return self._location

def signatures_with_location(self):
"Return an iterator over all signatures and their locations."
for ss, loc in self._signatures_with_internal():
yield ss, loc

def signatures(self):
"Return an iterator over all signatures."
for ss, loc in self._signatures_with_internal():
yield ss

def _signatures_with_internal(self):
"""Return an iterator over all sigs of (sig, internal_location)

Note that this is implemented differently from most Index
objects in that it only lists subselected parts of the
manifest, and not the original manifest. This was done out of
convenience: we don't currently have access to the original
manifest in this class.
"""
# collect all internal locations
iloc_to_rows = defaultdict(list)
for row in self.manifest.rows:
iloc = row['internal_location']
iloc_to_rows[iloc].append(row)

# iterate over internal locations, selecting relevant sigs
for iloc, iloc_rows in iloc_to_rows.items():
# prepend with prefix?
if not iloc.startswith('/') and self.prefix:
iloc = os.path.join(self.prefix, iloc)

sub_mf = CollectionManifest(iloc_rows)
picklist = sub_mf.to_picklist()

idx = sourmash.load_file_as_index(iloc)
idx = idx.select(picklist=picklist)
for ss in idx.signatures():
yield ss, iloc

def __len__(self):
"Number of signatures in this manifest (after any select)."
return len(self.manifest)

def __bool__(self):
"Is this manifest empty?"
return bool(self.manifest)

def save(self, *args):
raise NotImplementedError

def insert(self, *args):
raise NotImplementedError

def select(self, **kwargs):
"Run 'select' on the manifest."
new_manifest = self.manifest.select_to_manifest(**kwargs)
return StandaloneManifestIndex(new_manifest, self._location,
prefix=self.prefix)
4 changes: 4 additions & 0 deletions src/sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ class LCA_Database(Index):
"""
is_database = True

# we set manifest to None to avoid implication of fast on-disk access to
# sketches. This may be revisited later.
manifest = None

def __init__(self, ksize, scaled, moltype='DNA'):
self.ksize = int(ksize)
self.scaled = int(scaled)
Expand Down
7 changes: 7 additions & 0 deletions src/sourmash/sourmash_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,12 @@ def _load_stdin(filename, **kwargs):
return db


def _load_standalone_manifest(filename, **kwargs):
from sourmash.index import StandaloneManifestIndex
idx = StandaloneManifestIndex.load(filename)
return idx


def _multiindex_load_from_pathlist(filename, **kwargs):
"Load collection from a list of signature/database files"
db = MultiIndex.load_from_pathlist(filename)
Expand Down Expand Up @@ -416,6 +422,7 @@ def _load_zipfile(filename, **kwargs):
# all loader functions, in order.
_loader_functions = [
("load from stdin", _load_stdin),
("load from standalone manifest", _load_standalone_manifest),
("load from path (file or directory)", _multiindex_load_from_path),
("load from file list", _multiindex_load_from_pathlist),
("load SBT", _load_sbt),
Expand Down
17 changes: 17 additions & 0 deletions tests/test-data/scaled/mf.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
all.lca.json,455c2f95f2d0a95e176870659119f170,455c2f95,31,DNA,0,10000,93,0,,
all.lca.json,684aa226f843eaa7e1e40fc5603d5f2a,684aa226,31,DNA,0,10000,48,0,,
all.lca.json,7f7835d2dd27ba703e843eee4757f3c2,7f7835d2,31,DNA,0,10000,8,0,,
all.lca.json,7ffcfaa4027d4153a991b6bd78cf39fe,7ffcfaa4,31,DNA,0,10000,45,0,,
all.lca.json,d84ef28f610b1783f801734699cf7e40,d84ef28f,31,DNA,0,10000,45,0,,
genome-s10+s11.fa.gz.sig,455c2f95f2d0a95e176870659119f170,455c2f95,31,DNA,0,10000,93,0,,../genome-s10+s11.fa.gz
genome-s11.fa.gz.sig,7ffcfaa4027d4153a991b6bd78cf39fe,7ffcfaa4,31,DNA,0,10000,45,0,,../genome-s11.fa.gz
all.sbt.zip,684aa226f843eaa7e1e40fc5603d5f2a,684aa226,31,DNA,0,10000,48,0,,../genome-s10.fa.gz
all.sbt.zip,7f7835d2dd27ba703e843eee4757f3c2,7f7835d2,31,DNA,0,10000,8,0,,../genome-s10-small.fa.gz
all.sbt.zip,7ffcfaa4027d4153a991b6bd78cf39fe,7ffcfaa4,31,DNA,0,10000,45,0,,../genome-s11.fa.gz
all.sbt.zip,455c2f95f2d0a95e176870659119f170,455c2f95,31,DNA,0,10000,93,0,,../genome-s10+s11.fa.gz
all.sbt.zip,d84ef28f610b1783f801734699cf7e40,d84ef28f,31,DNA,0,10000,45,0,,../genome-s12.fa.gz
genome-s10-small.fa.gz.sig,7f7835d2dd27ba703e843eee4757f3c2,7f7835d2,31,DNA,0,10000,8,0,,../genome-s10-small.fa.gz
genome-s12.fa.gz.sig,d84ef28f610b1783f801734699cf7e40,d84ef28f,31,DNA,0,10000,45,0,,../genome-s12.fa.gz
genome-s10.fa.gz.sig,684aa226f843eaa7e1e40fc5603d5f2a,684aa226,31,DNA,0,10000,48,0,,../genome-s10.fa.gz
7 changes: 7 additions & 0 deletions tests/test-data/scaled/pathlist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
all.lca.json
all.sbt.zip
genome-s10+s11.fa.gz.sig
genome-s10-small.fa.gz.sig
genome-s10.fa.gz.sig
genome-s11.fa.gz.sig
genome-s12.fa.gz.sig
32 changes: 32 additions & 0 deletions tests/test_cmd_signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -3376,6 +3376,31 @@ def test_sig_describe_2_exclude_db_pattern(runtmp):
assert line.strip() in out


def test_sig_describe_3_manifest_works(runtmp):
# test on a manifest with relative paths, in proper location
mf = utils.get_test_data('scaled/mf.csv')
runtmp.sourmash('sig', 'describe', mf, '--csv', 'out.csv')

out = runtmp.last_result.out
print(out)

with open(runtmp.output('out.csv'), newline='') as fp:
r = csv.reader(fp)
rows = list(r)
assert len(rows) == 16 # 15 signatures, plus head


def test_sig_describe_3_manifest_fails_when_moved(runtmp):
# test on a manifest with relative paths, when in wrong place;
# should fail, because actual signatures cannot be loaded now.
# note: this tests lazy loading.
mf = utils.get_test_data('scaled/mf.csv')
shutil.copyfile(mf, runtmp.output('mf.csv'))

with pytest.raises(SourmashCommandFailed):
runtmp.sourmash('sig', 'describe', 'mf.csv')


@utils.in_tempdir
def test_sig_overlap(c):
# get overlap details
Expand Down Expand Up @@ -3566,6 +3591,13 @@ def test_sig_manifest_6_pathlist(runtmp):
assert '16869d2c8a1d29d1c8e56f5c561e585e' in md5_list
assert '120d311cc785cc9d0df9dc0646b2b857' in md5_list

# note: the manifest output for pathlists will contain the locations
# used in the pathlist. This is required by StandaloneManifestIndex.
for row in manifest.rows:
iloc = row['internal_location']
print(iloc)
assert iloc.startswith('/'), iloc


def test_sig_manifest_does_not_exist(runtmp):
with pytest.raises(SourmashCommandFailed):
Expand Down
Loading