Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add zipfile collection support #1429

Closed
wants to merge 87 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
6bacb41
implement a simple ZipFileLinearIndex class
ctb Feb 25, 2021
61245eb
fix load_file_as_signatures
ctb Feb 25, 2021
31b0ca4
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Feb 26, 2021
b22248b
add tests for zipfile searching etc.
ctb Feb 26, 2021
8c9c9ae
add sig describe test for loading from zipfile
ctb Feb 26, 2021
7087d82
fix load_file_as_index to support zipfiles
ctb Feb 26, 2021
c75bd83
rename force; add docstrings
ctb Feb 26, 2021
da3a58c
Merge branch 'latest' into add/zipfile_index
ctb Mar 5, 2021
92e5fdc
add an IndexOfIndexes class
ctb Mar 6, 2021
5c71e11
rename to MultiIndex
ctb Mar 7, 2021
85efdaf
switch to using MultiIndex for loading from a directory
ctb Mar 7, 2021
04f9de1
some more MultiIndex tests
ctb Mar 7, 2021
201a89a
add test of MultiIndex.signatures
ctb Mar 7, 2021
07d2c32
add docstring for MultiIndex
ctb Mar 7, 2021
61d15c3
stop special-casing SIGLISTs
ctb Mar 7, 2021
16f9ee2
fix test to match more informative error message
ctb Mar 7, 2021
c6bf314
switch to using LinearIndex.load for stdin, too
ctb Mar 7, 2021
dd0f3b8
add __len__ to MultiIndex
ctb Mar 8, 2021
9211a74
add check_csv to check for appropriate filename loading info
ctb Mar 8, 2021
75069ff
add comment
ctb Mar 8, 2021
d2294fb
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
9f39623
fix databases load
ctb Mar 9, 2021
ac63cf8
more tests needed
ctb Mar 9, 2021
d5059eb
Merge branch 'latest' into add/multi_index
ctb Mar 9, 2021
3e06dbf
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
5590d70
add tests for incompatible signatures
ctb Mar 9, 2021
14891bd
add filter to LinearIndex and MultiIndex
ctb Mar 9, 2021
40395ff
clean up sourmash_args some more
ctb Mar 9, 2021
8c51452
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
fbf3bb9
Merge branch 'latest' into add/multi_index
ctb Mar 12, 2021
abd84b2
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Mar 12, 2021
dd52be6
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 24, 2021
f377dc4
shift loading over to Index classes
ctb Mar 24, 2021
250c49a
refactor, fix tests
ctb Mar 24, 2021
9a921f9
switch to a list of loader functions
ctb Mar 25, 2021
780fb71
comments, docstrings, and tests passing
ctb Mar 26, 2021
d261963
update to use f strings throughout sourmash_args.py
ctb Mar 26, 2021
4b4174e
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
93fca04
add docstrings
ctb Mar 26, 2021
0203357
update comments
ctb Mar 26, 2021
cd53f02
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
8a0200a
remove unnecessary changes
ctb Mar 26, 2021
e9df90f
revert to original test
ctb Mar 26, 2021
9e427e3
remove unneeded comment
ctb Mar 26, 2021
0dd390a
clean up a bit
ctb Mar 26, 2021
2c0ee29
debugging update
ctb Mar 27, 2021
edcb483
better exception raising and capture for signature parsing
ctb Mar 27, 2021
3f6c3f2
more specific error message
ctb Mar 27, 2021
78dbb1d
revert change in favor of creating new issue
ctb Mar 27, 2021
229b1d7
add commentary => TODO
ctb Mar 28, 2021
20ed9f0
add tests for MultiIndex.load_from_directory; fix traverse code
ctb Mar 28, 2021
16a119e
switch lca summarize over to usig MultiIndex
ctb Mar 28, 2021
cb1e8a3
switch to using MultiIndex in categorize
ctb Mar 28, 2021
c9e176d
remove LoadSingleSignatures
ctb Mar 28, 2021
8f914f1
test errors in lca database loading
ctb Mar 28, 2021
a43b011
remove unneeded categorize code
ctb Mar 28, 2021
15328ae
add testme info
ctb Mar 28, 2021
f674232
verified that this was tested
ctb Mar 28, 2021
01c54c0
remove testme comments
ctb Mar 28, 2021
ae3f66d
add tests for MultiIndex.load_from_file_list
ctb Mar 28, 2021
7f52d7c
refactor select, add scaled/num/abund
ctb Mar 28, 2021
dde14fd
more work
ctb Mar 28, 2021
3f498a4
catch ValueError from db.select
ctb Mar 29, 2021
df19926
update debug print to sys.stder
ctb Mar 29, 2021
e8233ca
fix scaled check for LCA database
ctb Mar 29, 2021
b44c3cf
add debug_literal
ctb Mar 29, 2021
7133ac1
break things when filter returns empty Index
ctb Mar 29, 2021
f5f1c9c
fix scaled check for SBT
ctb Mar 29, 2021
d6f156f
fix a few tests
ctb Mar 30, 2021
785a9a4
fix LCA database ksize message & test
ctb Mar 30, 2021
23d7ac4
flag for removal
ctb Mar 30, 2021
efc07cd
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/si…
ctb Mar 30, 2021
12399e7
add 'containment' to 'select'
ctb Mar 31, 2021
b6a4dff
Merge branch 'latest' into refactor/db_load_multiindex
ctb Mar 31, 2021
2b7acb9
fix remaining tests
ctb Mar 31, 2021
f663426
Merge branch 'refactor/db_load_multiindex' into refactor/siglist_loading
ctb Mar 31, 2021
9aae1cb
update comments
ctb Mar 31, 2021
2630be2
remove all the cruft, yay
ctb Mar 31, 2021
4f1a7fe
added 'is_database' flag for nicer UX
ctb Mar 31, 2021
736ddf3
remove overly broad exception catching
ctb Mar 31, 2021
16719ce
add docstrings
ctb Mar 31, 2021
6d8663e
document downsampling foo
ctb Mar 31, 2021
9832810
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Apr 1, 2021
4854325
Merge branch 'refactor/siglist_loading' into merge
ctb Apr 1, 2021
c4de8fb
update for additional test files
ctb Apr 1, 2021
31194bf
update ZipFileLinearIndex for new selector criteria
ctb Apr 1, 2021
be502ab
remove leftover code fragment
ctb Apr 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,8 @@ def search(args):

def categorize(args):
"Use a database to find the best match to many signatures."
from .index import MultiIndex

set_quiet(args.quiet)
moltype = sourmash_args.calculate_moltype(args)

Expand All @@ -533,24 +535,27 @@ def categorize(args):
# load search database
tree = load_sbt_index(args.sbt_name)

# load query filenames
inp_files = set(sourmash_args.traverse_find_sigs(args.queries))
inp_files = inp_files - already_names

notify('found {} files to query', len(inp_files))

loader = sourmash_args.LoadSingleSignatures(inp_files,
args.ksize, moltype)
# utility function to load & select relevant signatures.
def _yield_all_sigs(queries, ksize, moltype):
for filename in queries:
mi = MultiIndex.load_from_path(filename, False)
mi = mi.select(ksize=ksize, moltype=moltype)
for ss, loc in mi.signatures_with_location():
yield ss, loc

csv_w = None
csv_fp = None
if args.csv:
csv_fp = open(args.csv, 'w', newline='')
csv_w = csv.writer(csv_fp)

for queryfile, query, query_moltype, query_ksize in loader:
for query, loc in _yield_all_sigs(args.queries, args.ksize, moltype):
# skip if we've already done signatures from this file.
if loc in already_names:
continue

notify('loaded query: {}... (k={}, {})', str(query)[:30],
query_ksize, query_moltype)
query.minhash.ksize, query.minhash.moltype)

results = []
search_fn = SearchMinHashesFindBest().search
Expand All @@ -575,14 +580,9 @@ def categorize(args):
notify('for {}, no match found', query)

if csv_w:
csv_w.writerow([queryfile, query, best_hit_query_name,
csv_w.writerow([loc, query, best_hit_query_name,
best_hit_sim])

if loader.skipped_ignore:
notify('skipped/ignore: {}', loader.skipped_ignore)
if loader.skipped_nosig:
notify('skipped/nosig: {}', loader.skipped_nosig)

if csv_fp:
csv_fp.close()

Expand Down
182 changes: 170 additions & 12 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
"An Abstract Base Class for collections of signatures."

import sourmash
from abc import abstractmethod, ABC
from collections import namedtuple
import zipfile
import os


class Index(ABC):
is_database = False

@abstractmethod
def signatures(self):
"Return an iterator over all signatures in the Index object."
Expand Down Expand Up @@ -122,8 +127,55 @@ def gather(self, query, *args, **kwargs):
return results

@abstractmethod
def select(self, ksize=None, moltype=None):
""
def select(self, ksize=None, moltype=None, scaled=None, num=None,
abund=None, containment=None):
"""Return Index containing only signatures that match requirements.

Current arguments can be any or all of:
* ksize
* moltype
* scaled
* num
* containment

'select' will raise ValueError if the requirements are incompatible
with the Index subclass.

'select' may return an empty object or None if no matches can be
found.
"""


def select_signature(ss, ksize=None, moltype=None, scaled=0, num=0,
containment=False):
"Check that the given signature matches the specificed requirements."
# ksize match?
if ksize and ksize != ss.minhash.ksize:
return False

# moltype match?
if moltype and moltype != ss.minhash.moltype:
return False

# containment requires scaled; similarity does not.
if containment:
if not scaled:
raise ValueError("'containment' requires 'scaled' in Index.select'")
if not ss.minhash.scaled:
return False

# 'scaled' and 'num' are incompatible
if scaled:
if ss.minhash.num:
return False
if num:
# note, here we check if 'num' is identical; this can be
# changed later.
if ss.minhash.scaled or num != ss.minhash.num:
return False

return True


class LinearIndex(Index):
"An Index for a collection of signatures. Can load from a .sig file."
Expand Down Expand Up @@ -155,23 +207,76 @@ def load(cls, location):
lidx = LinearIndex(si, filename=location)
return lidx

def select(self, ksize=None, moltype=None):
def select_sigs(ss, ksize=ksize, moltype=moltype):
if (ksize is None or ss.minhash.ksize == ksize) and \
(moltype is None or ss.minhash.moltype == moltype):
return True
def select(self, **kwargs):
"""Return new LinearIndex containing only signatures that match req's.

return self.filter(select_sigs)
Does not raise ValueError, but may return an empty Index.
"""
# eliminate things from kwargs with None or zero value
kw = { k : v for (k, v) in kwargs.items() if v }

def filter(self, filter_fn):
siglist = []
for ss in self._signatures:
if filter_fn(ss):
if select_signature(ss, **kwargs):
siglist.append(ss)

return LinearIndex(siglist, self.filename)


class ZipFileLinearIndex(Index):
"""\
A read-only collection of signatures in a zip file.

Does not support `insert` or `save`.
"""
is_database = True

def __init__(self, zf, selection_dict=None,
traverse_yield_all=False):
self.zf = zf
self.selection_dict = selection_dict
self.traverse_yield_all = traverse_yield_all

@property
def filename(self):
return self.zf.filename

def insert(self, signature):
raise NotImplementedError

def save(self, path):
raise NotImplementedError

@classmethod
def load(cls, location, traverse_yield_all=False):
"Class method to load a zipfile."
zf = zipfile.ZipFile(location, 'r')
return cls(zf, traverse_yield_all=traverse_yield_all)

def signatures(self):
"Load all signatures in the zip file."
from .signature import load_signatures
for zipinfo in self.zf.infolist():
# should we load this file? if it ends in .sig OR we are forcing:
if zipinfo.filename.endswith('.sig') or self.traverse_yield_all:
fp = self.zf.open(zipinfo)

# now load all the signatures and select on ksize/moltype:
selection_dict = self.selection_dict
for ss in load_signatures(fp):
if selection_dict:
if select_signature(ss, **self.selection_dict):
yield ss
else:
yield ss

def select(self, **kwargs):
"Select signatures in zip file based on ksize/moltype."
return ZipFileLinearIndex(self.zf,
selection_dict=kwargs,
traverse_yield_all=self.traverse_yield_all)


class MultiIndex(Index):
"""An Index class that wraps other Index classes.

Expand All @@ -193,6 +298,11 @@ def signatures(self):
for ss in idx.signatures():
yield ss

def signatures_with_location(self):
for idx, loc in zip(self.index_list, self.source_list):
for ss in idx.signatures():
yield ss, loc

def __len__(self):
return sum([ len(idx) for idx in self.index_list ])

Expand All @@ -203,14 +313,62 @@ def insert(self, *args):
def load(self, *args):
raise NotImplementedError

@classmethod
def load_from_path(cls, pathname, force=False):
"Create a MultiIndex from a path (filename or directory)."
from .sourmash_args import traverse_find_sigs
if not os.path.exists(pathname):
raise ValueError(f"'{pathname}' must be a directory")

index_list = []
source_list = []
for thisfile in traverse_find_sigs([pathname], yield_all_files=force):
try:
idx = LinearIndex.load(thisfile)
index_list.append(idx)
source_list.append(thisfile)
except (IOError, sourmash.exceptions.SourmashError):
if force:
continue # ignore error
else:
raise # contine past error!

db = None
if index_list:
db = cls(index_list, source_list)
else:
raise ValueError(f"no signatures to load under directory '{pathname}'")

return db

@classmethod
def load_from_file_list(cls, filename):
"Create a MultiIndex from all files listed in a text file."
from .sourmash_args import (load_file_list_of_signatures,
load_file_as_index)
idx_list = []
src_list = []

file_list = load_file_list_of_signatures(filename)
for fname in file_list:
idx = load_file_as_index(fname)
src = fname

idx_list.append(idx)
src_list.append(src)

db = MultiIndex(idx_list, src_list)
return db

def save(self, *args):
raise NotImplementedError

def select(self, ksize=None, moltype=None):
def select(self, **kwargs):
"Run 'select' on all indices within this MultiIndex."
new_idx_list = []
new_src_list = []
for idx, src in zip(self.index_list, self.source_list):
idx = idx.select(ksize=ksize, moltype=moltype)
idx = idx.select(**kwargs)
new_idx_list.append(idx)
new_src_list.append(src)

Expand Down
20 changes: 8 additions & 12 deletions src/sourmash/lca/command_summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from ..logging import notify, error, print_results, set_quiet, debug
from . import lca_utils
from .lca_utils import check_files_exist
from sourmash.index import MultiIndex


DEFAULT_THRESHOLD=5
Expand Down Expand Up @@ -61,20 +62,15 @@ def load_singletons_and_count(filenames, ksize, scaled, ignore_abundance):
total_count = 0
n = 0

# in order to get the right reporting out of this function, we need
# to do our own traversal to expand the list of filenames, as opposed
# to using load_file_as_signatures(...)
filenames = sourmash_args.traverse_find_sigs(filenames)
filenames = list(filenames)

total_n = len(filenames)

for query_filename in filenames:
for filename in filenames:
n += 1
for query_sig in sourmash_args.load_file_as_signatures(query_filename,
ksize=ksize):
mi = MultiIndex.load_from_path(filename)
mi = mi.select(ksize=ksize)

for query_sig, query_filename in mi.signatures_with_location():
notify(u'\r\033[K', end=u'')
notify('... loading {} (file {} of {})', query_sig, n,
notify(f'... loading {query_sig} (file {n} of {total_n})',
total_n, end='\r')
total_count += 1

Expand All @@ -87,7 +83,7 @@ def load_singletons_and_count(filenames, ksize, scaled, ignore_abundance):
yield query_filename, query_sig, hashvals

notify(u'\r\033[K', end=u'')
notify('loaded {} signatures from {} files total.', total_count, n)
notify(f'loaded {total_count} signatures from {n} files total.')


def count_signature(sig, scaled, hashvals):
Expand Down
Loading