Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] a simple JSON RPC client for sourmash search #1644

Open
wants to merge 119 commits into
base: add/manifest_lazy_sigfile
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
0997834
various cleanups of sourmash_args
ctb Jun 12, 2021
66b0599
cleanup flakes errors
ctb Jun 12, 2021
3a583a9
clean up sourmash.sig submodule
ctb Jun 12, 2021
bb794ec
initial picklist implementation
ctb Jun 12, 2021
3ecfb48
integrate picklists into sourmash sig extract
ctb Jun 12, 2021
505b04f
basic tests for picklist functionality
ctb Jun 12, 2021
74f31f5
track found etc
ctb Jun 12, 2021
b1fc982
add picklists to selectors
ctb Jun 12, 2021
a817843
split pickfile out a little bit
ctb Jun 12, 2021
def1933
split column_type out of SignaturePicklist a bit
ctb Jun 12, 2021
de6fc06
picklist tests for .signatures() methods on Index classes
ctb Jun 12, 2021
1bdf88e
split pickfile out a little bit
ctb Jun 12, 2021
3c05f95
split column_type out of SignaturePicklist a bit
ctb Jun 12, 2021
03cc61b
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 12, 2021
54407a3
test 'Index.find' on picklists for SBTs and LCAs
ctb Jun 12, 2021
a88b66d
factor out picklist checks to 'passes_all_picklists' fn
ctb Jun 13, 2021
b57b2b3
support special picklist interactions with zipfile collections
ctb Jun 13, 2021
e205e64
special case md5 prefixes, for prefetch
ctb Jun 13, 2021
6593a42
try out manifests
ctb Jun 13, 2021
14a5ee1
hacky but functional manifest support
ctb Jun 13, 2021
b2547f3
add missing manifest CLI file
ctb Jun 14, 2021
17b9576
build out a manifest class a bit
ctb Jun 14, 2021
01d33fc
provide 'select' more generically on manifests
ctb Jun 14, 2021
cb8e28d
get started adding manifests to MultiIndex
ctb Jun 14, 2021
2f2269b
work through manifests for MultiIndex
ctb Jun 14, 2021
1d7e0cf
update comment about picklist.found
ctb Jun 14, 2021
67a9be1
more comment
ctb Jun 14, 2021
031522c
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklist
ctb Jun 14, 2021
5ac4671
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 14, 2021
23c1531
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
ctb Jun 14, 2021
3c0c9cf
try making manifests obligatory for MultiIndex
ctb Jun 15, 2021
be9ef77
create LoadedCollection to replace MultiIndex non-lazy loading
ctb Jun 15, 2021
915f847
cleanup/simplification of LoadedCollection
ctb Jun 15, 2021
c6cb1af
fix all the tests
ctb Jun 15, 2021
af5eb86
fix test names for new LoadedCollection
ctb Jun 15, 2021
509eb45
remove MultiIndex
ctb Jun 15, 2021
c3b6fc0
more cleanup
ctb Jun 15, 2021
ab0fc0e
misc cleanup
ctb Jun 15, 2021
8a8c3b2
shift signature metadata matching from manifests over to picklist
ctb Jun 15, 2021
230c793
cleanup and simplification of ZipFile stuff
ctb Jun 15, 2021
730a717
more cleanup and docs
ctb Jun 15, 2021
a4057e6
create LazyMultiIndex
ctb Jun 15, 2021
72d8497
move manifest stuff into manifest class
ctb Jun 16, 2021
1dd8170
add manifests to SBTs
ctb Jun 16, 2021
39abe57
CSV output function
ctb Jun 16, 2021
75dc079
Merge branch 'add/picklist_zf_manifests' into add/picklist_manifests_sbt
ctb Jun 16, 2021
c356842
done, I think?
ctb Jun 16, 2021
a7e153a
fix tests
ctb Jun 16, 2021
aaa4548
update comments, constructor, etc.
ctb Jun 16, 2021
9b50748
fix tests :)
ctb Jun 16, 2021
207a813
more picklist tests
ctb Jun 16, 2021
14a88a7
verify output
ctb Jun 16, 2021
3d23d87
add --picklist-require-all &c
ctb Jun 16, 2021
9d60e32
documentation
ctb Jun 16, 2021
8f65f22
test with --md5 selector
ctb Jun 16, 2021
4f8e20c
cover untested code with tests
ctb Jun 16, 2021
14b87d4
trap errors and be nice to users
ctb Jun 16, 2021
04c209c
remove comment
ctb Jun 16, 2021
8e5fb8d
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 16, 2021
b8f4bb8
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklis…
ctb Jun 16, 2021
21ce4b7
fix tests for new SignaturePicklist
ctb Jun 16, 2021
b3c6bb9
move picklist.py from sourmash.sig into sourmash
ctb Jun 17, 2021
fddf141
move picklist reporting into sourmash_args
ctb Jun 17, 2021
984a557
fix space
ctb Jun 17, 2021
ced72d2
add picklist args throughout, eek.
ctb Jun 17, 2021
7a30b20
add picklists and tests for search, gather, index
ctb Jun 17, 2021
c0e5781
add picklists to prefetch
ctb Jun 17, 2021
a0335a3
add picklists to sourmash compare
ctb Jun 17, 2021
a074127
add picklists to lca index
ctb Jun 17, 2021
ba5c8bc
block multiple picklists on SBTs and LCAs, for now
ctb Jun 17, 2021
bba101c
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
ctb Jun 17, 2021
ca6ea4f
add picklist test that checks indexing-and-then-search == index
ctb Jun 17, 2021
c965648
add a test for using prefetch CSV as picklist
ctb Jun 17, 2021
ab286cf
remove debugging print
ctb Jun 17, 2021
4d156e9
add docs
ctb Jun 17, 2021
7937292
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
ctb Jun 17, 2021
f697ec4
fix coltypes
ctb Jun 17, 2021
de6f3c4
remove order dependence from test
ctb Jun 17, 2021
122d043
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
ctb Jun 17, 2021
54ea3f9
only match picklist at end of 'select'
ctb Jun 17, 2021
8812142
further attempt to fix test
ctb Jun 17, 2021
5cad5ff
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
ctb Jun 17, 2021
e1e367a
remove @CTB comments
ctb Jun 17, 2021
9e46ff8
cleanup of comments etc.
ctb Jun 17, 2021
2da0085
Merge branch 'add/picklist_zf_manifests' into add/picklist_manifests_sbt
ctb Jun 17, 2021
d4a9a2e
fix test for manifests
ctb Jun 17, 2021
31018df
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklis…
ctb Jun 18, 2021
4221fc9
Merge branch 'add/picklist_zf_manifests' into add/picklist_manifests_sbt
ctb Jun 18, 2021
d95813e
add manifest versions
ctb Jun 19, 2021
287cb7b
a lazy signature loading class, using manifests
ctb Jun 20, 2021
e315c90
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklis…
ctb Jun 22, 2021
e301645
add a test for sig manifest
ctb Jun 22, 2021
c243b0e
add manifest tests
ctb Jun 22, 2021
ba2e53c
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklis…
ctb Jun 22, 2021
2756e7d
add save/load test
ctb Jun 22, 2021
ed5fb7a
rename matches_siginfo to matches_manifest_row
ctb Jun 22, 2021
71b81ed
add docstring
ctb Jun 22, 2021
c3f1a3d
reverse order of adding to seen set
ctb Jun 22, 2021
7486871
Merge branch 'add/picklist_zf_manifests' into add/picklist_manifests_sbt
ctb Jun 22, 2021
9bb6a9b
fix header writing
ctb Jun 22, 2021
6ebec9c
Merge branch 'add/picklist_zf_manifests' into add/manifest_lazy_sigfile
ctb Jun 22, 2021
60a6eec
change LoadedCollection back over to MultiIndex; remove LazyMultiIndex
ctb Jun 22, 2021
fe83b68
revert collection to multiindex
ctb Jun 23, 2021
83e387e
Merge branch 'add/picklist_manifests_sbt' into add/picklist_zf_manifests
ctb Jun 23, 2021
0adee52
remove print
ctb Jun 23, 2021
99199ee
move manifest stuff to manifest.py
ctb Jun 23, 2021
0a982cb
Merge branch 'add/picklist_manifests_sbt' into add/manifest_lazy_sigfile
ctb Jun 23, 2021
da381a1
Merge branch 'add/picklist_zf_manifests' into add/manifest_lazy_sigfile
ctb Jun 23, 2021
6e17c1a
re-add LazyMultiIndex here
ctb Jun 23, 2021
b633db1
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Jun 24, 2021
a60438e
fix manifest code
ctb Jun 24, 2021
cc44288
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Jun 26, 2021
75ba72b
upgrade debugging output on _load_database
ctb Jun 26, 2021
02c485d
restore LazyLoadedSigfile, add comments/docstrings
ctb Jun 26, 2021
61c1b24
doc strings, comments, etc
ctb Jun 27, 2021
cc3b643
some basic tests of the lazy loading indices
ctb Jun 27, 2021
3a25095
change from sigfile to lca file for no manifests :)
ctb Jun 27, 2021
cb1594f
add manifest_of_manifests
ctb Jun 27, 2021
7718240
support a simple JSON RPC mechanism for databases
ctb Jun 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 228 additions & 6 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,42 @@
"An Abstract Base Class for collections of signatures."
"""An Abstract Base Class for collections of signatures, plus implementations.

APIs and functionality
----------------------

Index classes support three sets of API functionality -

'select(...)', which selects subsets of signatures based on ksize, moltype,
and other criteria, including picklists.

'find(...)', and the 'search', 'gather', and 'counter_gather' implementations
built on top of 'find', which search for signatures that match a query.

'signatures()', which yields all signatures in the Index subject to the
selection criteria.

Classes defined in this file
----------------------------

Index - abstract base class for all Index objects.

LinearIndex - simple in-memory storage of signatures.

LazyLinearIndex - lazy selection and linear search of signatures.

ZipFileLinearIndex - simple on-disk storage of signatures.

class MultiIndex - in-memory storage and selection of signatures from multiple
index objects, using manifests.

class LazyLoadedIndex - lazy-loading wrapper for on-disk indices, using
manifests. Signatures are kept on disk until requested; only manifests are
retained, and no open file handles or signatures.

class LazyMultiIndex - lazy-loading wrapper for many on-disk indices.
Signatures are kept on disk until requested.

CounterGather - an ancillary class returned by the 'counter_gather()' method.
"""

import os
import sourmash
Expand All @@ -17,6 +55,7 @@

class Index(ABC):
is_database = False
manifest = None

@property
def location(self):
Expand Down Expand Up @@ -390,9 +429,20 @@ def select(self, **kwargs):
class LazyLinearIndex(Index):
"""An Index for lazy linear search of another database.

The defining feature of this class is that 'find' is inherited
from the base Index class, which does a linear search with
signatures().
One of the main purposes of this class is to _force_ linear 'find'
on index objects. So if this class wraps an SBT, for example, the
SBT find method will be overriden with the linear 'find' from the
base class. There are very few situations where this is an improvement,
so use this class wisely!

A few notes:
* selection criteria defined by 'select' are only executed when
signatures are actually requested (hence, 'lazy').
* this class stores the provided index 'db' in memory. If you need
a class that does lazy loading of signatures from disk and does not
store signatures in memory, see LazyLoadedIndex.
* if you want efficient in-memory manifest-based selection, consider
LazyMultiIndex.
"""

def __init__(self, db, selection_dict={}):
Expand Down Expand Up @@ -474,7 +524,7 @@ def __init__(self, zf, *, selection_dict=None,
if self.manifest is not None:
assert not self.selection_dict, self.selection_dict
if self.selection_dict:
assert manifest is None
assert self.manifest is None

def _load_manifest(self):
"Load a manifest if one exists"
Expand Down Expand Up @@ -783,13 +833,17 @@ def signatures(self):
yield row['signature']

def signatures_with_location(self):
for row in self.manifest.rows:
yield row['signature'], row['internal_location']

def _signatures_with_internal(self):
"""Return an iterator of tuples (ss, location)

CTB note: here, 'internal_location' is the source file for the
index. This is a special feature of this (in memory) class.
"""
for row in self.manifest.rows:
yield row['signature'], row['internal_location']
yield row['signature'], "", row['internal_location']

def __len__(self):
return len(self.manifest)
Expand Down Expand Up @@ -884,3 +938,171 @@ def select(self, **kwargs):
"Run 'select' on the manifest."
new_manifest = self.manifest.select_to_manifest(**kwargs)
return MultiIndex(new_manifest)


class LazyLoadedIndex(Index):
"""Given an index location and a manifest, do select only on the manifest
until signatures are actually requested, and only then load the index.

This class is useful when you have an index object that consume
memory when it is loaded (e.g. JSON signature files, or LCA
databases) and you want to avoid keeping them in memory. The
downside of using this class is that it will load the signatures
from disk every time they are needed (e.g. 'find(...)', 'signatures()').

Can be used with LazyMultiIndex to support many such indices at once.

"""
def __init__(self, filename, manifest):
"Create an Index with given filename and manifest."
self.filename = filename
self.manifest = manifest

@property
def location(self):
"the 'location' attribute for this index will be the filename."
return self.filename

def signatures(self):
"yield all signatures from the manifest."
if not len(self):
# nothing in manifest? done!
return []

# ok - something in manifest, let's go get those signatures!
picklist = self.manifest.to_picklist()
idx = sourmash.load_file_as_index(self.location)

# convert remaining manifest into picklist
idx = idx.select(picklist=picklist)

# extract signatures.
for ss in idx.signatures():
yield ss

def __len__(self):
"track index size based on the manifest."
return len(self.manifest)
__bool__ = __len__

@classmethod
def load(cls, location, *, create_manifest=False):
"Load manifest from given location, and then unload."
idx = sourmash.load_file_as_index(location)
manifest = idx.manifest

# do we need to create the manifest?
if manifest is None:
if create_manifest:
iter = idx._signatures_with_internal()
manifest = CollectionManifest.create_manifest(idx,
include_signature=False)
else:
raise ValueError(f"no manifest on index at {location}")

# NOTE: index is not retained outside this scope, just location.

return cls(location, manifest)

def insert(self, *args):
raise NotImplementedError

def save(self, *args):
raise NotImplementedError

def select(self, **kwargs):
"Run 'select' on manifest, return new object with new manifest."
manifest = self.manifest
new_manifest = manifest.select_to_manifest(**kwargs)

return LazyLoadedIndex(self.filename, new_manifest)


class LazyMultiIndex(Index):
"""
Do lazy selection of multiple index objects w/manifests.

Maintains a manifest per collection, and touches the index objects
only when actual signatures are needed. This permits lazy loading
when wrapping Index classes that are on disk, e.g. ZipFileLinearIndex
and SBTs.

Differs from MultiIndex in that only the manifests are held in memory,
not any of the signatures.

CTB: could we get the same functionality from MultiIndex + LazyLoadedIndex?
"""
def __init__(self, index_list, manifest_list):
assert len(index_list) == len(manifest_list)
self.index_list = index_list
self.manifest_list = manifest_list

def signatures(self):
for ss, loc in self.signatures_with_location():
yield ss

def signatures_with_location(self):
for idx, manifest in zip(self.index_list, self.manifest_list):
# convert manifest to picklist:
picklist = manifest.to_picklist()

# select using picklist:
idx_new = idx.select(picklist=picklist)

# yield all remaining signatures:
for ss, loc in idx_new.signatures_with_location():
yield ss, loc

def __len__(self):
return sum( [len(m) for m in self.manifest_list] )

def insert(self, *args):
raise NotImplementedError

@classmethod
def load(cls, index_list):
"""Create a LazyMultiIndex from a loaded list of index objects.

All index objects must have manifests already.
"""

manifest_list = []
for idx in index_list:
if not idx.manifest:
raise ValueError(f"no manifest on {repr(idx)}")
manifest_list.append(idx.manifest)

# create obj!
return cls(index_list, manifest_list)

@classmethod
def load_from_pathlist(cls, filename):
"""Create a LazyMultiIndex from all files listed in a text file.

Note, this will not currently work for indices without manifests.
"""
from .sourmash_args import (load_pathlist_from_file,
load_file_as_index)
idx_list = []

file_list = load_pathlist_from_file(filename)
for fname in file_list:
idx = load_file_as_index(fname)
manifest = getattr(idx, 'manifest', None)
if manifest is None:
raise ValueError(f"index at '{fname}' has no manifest")
idx_list.append(idx)

return cls.load(idx_list)

def save(self, *args):
raise NotImplementedError

def select(self, **kwargs):
"Run 'select' on all manifests."
new_manifests = []
for idx, manifest in zip(self.index_list, self.manifest_list):
new_manifest = manifest.select_to_manifest(**kwargs)
new_manifests.append(new_manifest)

return LazyMultiIndex(self.index_list, new_manifests)
56 changes: 56 additions & 0 deletions src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Manifests for collections of signatures.
"""
import csv
from collections import defaultdict

from sourmash.picklist import SignaturePicklist

Expand Down Expand Up @@ -183,3 +184,58 @@ def to_picklist(self):
picklist.pickset = set(self._md5_set)

return picklist


class ManifestOfManifests:
# @CTB rename to MultiManifest?
def __init__(self, locations, manifests):
assert len(locations) == len(manifests)
self.locations = locations
self.manifests = manifests

def __len__(self):
return sum([ len(m) for m in self.manifests ])

@classmethod
def load_from_sqlite(cls, filename):
import sqlite3
db = sqlite3.connect(filename)
cursor = db.cursor()
cursor.execute('SELECT DISTINCT index_location, internal_location, md5, md5short, ksize, moltype, num, scaled, n_hashes, with_abundance, name, filename FROM manifest')

d = defaultdict(list)
rowkeys = 'internal_location, md5, md5short, ksize, moltype, num, scaled, n_hashes, with_abundance, name, filename'.split(', ')
print(rowkeys)
for result in cursor:
loc, *rest = result
mrow = dict(zip(rowkeys, rest))
d[loc].append(mrow)

locs = []
manifests = []
for loc, value in d.items():
manifest = CollectionManifest(value)
locs.append(loc)
manifests.append(manifest)

return cls(locs, manifests)

def select_to_manifest(self, **kwargs):
new_manifests = []
for m in self.manifests:
m = m.select_to_manifest(**kwargs)
new_manifests.append(m)
return ManifestOfManifests(self.locations, new_manifests)

def locations(self):
raise NotImplementedError

def locations_and_manifests(self):
for (l, m) in zip(self.locations, self.manifests):
yield l, m

def __contains__(self, ss):
for m in self.manifests:
if ss in m:
return True
return False
5 changes: 3 additions & 2 deletions src/sourmash/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
Code for searching collections of signatures.
"""
from collections import namedtuple
from enum import Enum
from enum import IntEnum
import numpy as np

from .signature import SourmashSignature


class SearchType(Enum):
class SearchType(IntEnum):
JACCARD = 1
CONTAINMENT = 2
MAX_CONTAINMENT = 3
Expand Down Expand Up @@ -95,6 +95,7 @@ def __init__(self, search_type, threshold=None):
require_scaled = True
self.score_fn = score_fn
self.require_scaled = require_scaled
self.search_type = search_type

if threshold is None:
threshold = 0
Expand Down
Loading