Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] refactor & clean up database loading around MultiIndex class #1406

Merged
merged 56 commits into from
Apr 2, 2021
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
92e5fdc
add an IndexOfIndexes class
ctb Mar 6, 2021
5c71e11
rename to MultiIndex
ctb Mar 7, 2021
85efdaf
switch to using MultiIndex for loading from a directory
ctb Mar 7, 2021
04f9de1
some more MultiIndex tests
ctb Mar 7, 2021
201a89a
add test of MultiIndex.signatures
ctb Mar 7, 2021
07d2c32
add docstring for MultiIndex
ctb Mar 7, 2021
61d15c3
stop special-casing SIGLISTs
ctb Mar 7, 2021
16f9ee2
fix test to match more informative error message
ctb Mar 7, 2021
c6bf314
switch to using LinearIndex.load for stdin, too
ctb Mar 7, 2021
dd0f3b8
add __len__ to MultiIndex
ctb Mar 8, 2021
9211a74
add check_csv to check for appropriate filename loading info
ctb Mar 8, 2021
75069ff
add comment
ctb Mar 8, 2021
d2294fb
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
9f39623
fix databases load
ctb Mar 9, 2021
ac63cf8
more tests needed
ctb Mar 9, 2021
d5059eb
Merge branch 'latest' into add/multi_index
ctb Mar 9, 2021
3e06dbf
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
5590d70
add tests for incompatible signatures
ctb Mar 9, 2021
14891bd
add filter to LinearIndex and MultiIndex
ctb Mar 9, 2021
40395ff
clean up sourmash_args some more
ctb Mar 9, 2021
8c51452
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
fbf3bb9
Merge branch 'latest' into add/multi_index
ctb Mar 12, 2021
dd52be6
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 24, 2021
f377dc4
shift loading over to Index classes
ctb Mar 24, 2021
250c49a
refactor, fix tests
ctb Mar 24, 2021
9a921f9
switch to a list of loader functions
ctb Mar 25, 2021
780fb71
comments, docstrings, and tests passing
ctb Mar 26, 2021
d261963
update to use f strings throughout sourmash_args.py
ctb Mar 26, 2021
4b4174e
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
93fca04
add docstrings
ctb Mar 26, 2021
0203357
update comments
ctb Mar 26, 2021
cd53f02
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
8a0200a
remove unnecessary changes
ctb Mar 26, 2021
e9df90f
revert to original test
ctb Mar 26, 2021
9e427e3
remove unneeded comment
ctb Mar 26, 2021
0dd390a
clean up a bit
ctb Mar 26, 2021
2c0ee29
debugging update
ctb Mar 27, 2021
edcb483
better exception raising and capture for signature parsing
ctb Mar 27, 2021
3f6c3f2
more specific error message
ctb Mar 27, 2021
78dbb1d
revert change in favor of creating new issue
ctb Mar 27, 2021
229b1d7
add commentary => TODO
ctb Mar 28, 2021
20ed9f0
add tests for MultiIndex.load_from_directory; fix traverse code
ctb Mar 28, 2021
16a119e
switch lca summarize over to usig MultiIndex
ctb Mar 28, 2021
cb1e8a3
switch to using MultiIndex in categorize
ctb Mar 28, 2021
c9e176d
remove LoadSingleSignatures
ctb Mar 28, 2021
8f914f1
test errors in lca database loading
ctb Mar 28, 2021
a43b011
remove unneeded categorize code
ctb Mar 28, 2021
15328ae
add testme info
ctb Mar 28, 2021
f674232
verified that this was tested
ctb Mar 28, 2021
01c54c0
remove testme comments
ctb Mar 28, 2021
ae3f66d
add tests for MultiIndex.load_from_file_list
ctb Mar 28, 2021
b6a4dff
Merge branch 'latest' into refactor/db_load_multiindex
ctb Mar 31, 2021
e4e20de
Expand signature selection and compatibility checking in database loa…
ctb Apr 1, 2021
65d79b9
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Apr 2, 2021
3229b5b
fix file_list -> pathlist
ctb Apr 2, 2021
6d6eb42
fix typo
ctb Apr 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,8 @@ def search(args):

def categorize(args):
"Use a database to find the best match to many signatures."
from .index import MultiIndex

set_quiet(args.quiet)
moltype = sourmash_args.calculate_moltype(args)

Expand All @@ -533,24 +535,27 @@ def categorize(args):
# load search database
tree = load_sbt_index(args.sbt_name)

# load query filenames
inp_files = set(sourmash_args.traverse_find_sigs(args.queries))
inp_files = inp_files - already_names

notify('found {} files to query', len(inp_files))

loader = sourmash_args.LoadSingleSignatures(inp_files,
args.ksize, moltype)
# utility function to load & select relevant signatures.
def _yield_all_sigs(queries, ksize, moltype):
for filename in queries:
mi = MultiIndex.load_from_path(filename, False)
mi = mi.select(ksize=ksize, moltype=moltype)
for ss, loc in mi.signatures_with_location():
yield ss, loc

csv_w = None
csv_fp = None
if args.csv:
csv_fp = open(args.csv, 'w', newline='')
csv_w = csv.writer(csv_fp)

for queryfile, query, query_moltype, query_ksize in loader:
for query, loc in _yield_all_sigs(args.queries, args.ksize, moltype):
# skip if we've already done signatures from this file.
if loc in already_names:
continue

notify('loaded query: {}... (k={}, {})', str(query)[:30],
query_ksize, query_moltype)
query.minhash.ksize, query.minhash.moltype)

results = []
search_fn = SearchMinHashesFindBest().search
Expand All @@ -575,14 +580,9 @@ def categorize(args):
notify('for {}, no match found', query)

if csv_w:
csv_w.writerow([queryfile, query, best_hit_query_name,
csv_w.writerow([loc, query, best_hit_query_name,
best_hit_sim])

if loader.skipped_ignore:
notify('skipped/ignore: {}', loader.skipped_ignore)
if loader.skipped_nosig:
notify('skipped/nosig: {}', loader.skipped_nosig)

if csv_fp:
csv_fp.close()

Expand Down
54 changes: 54 additions & 0 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
"An Abstract Base Class for collections of signatures."

import sourmash
from abc import abstractmethod, ABC
from collections import namedtuple
import os


class Index(ABC):
Expand Down Expand Up @@ -193,6 +195,11 @@ def signatures(self):
for ss in idx.signatures():
yield ss

def signatures_with_location(self):
for idx, loc in zip(self.index_list, self.source_list):
for ss in idx.signatures():
yield ss, loc

ctb marked this conversation as resolved.
Show resolved Hide resolved
def __len__(self):
return sum([ len(idx) for idx in self.index_list ])

Expand All @@ -203,6 +210,53 @@ def insert(self, *args):
def load(self, *args):
raise NotImplementedError

@classmethod
def load_from_path(cls, pathname, force=False):
"Create a MultiIndex from a path (filename or directory)."
from .sourmash_args import traverse_find_sigs
if not os.path.exists(pathname):
raise ValueError(f"'{pathname}' must be a directory")

index_list = []
source_list = []
for thisfile in traverse_find_sigs([pathname], yield_all_files=force):
try:
idx = LinearIndex.load(thisfile)
index_list.append(idx)
source_list.append(thisfile)
except (IOError, sourmash.exceptions.SourmashError):
if force:
continue # ignore error
else:
raise # contine past error!
ctb marked this conversation as resolved.
Show resolved Hide resolved

db = None
if index_list:
db = cls(index_list, source_list)
else:
raise ValueError(f"no signatures to load under directory '{pathname}'")

return db

@classmethod
def load_from_file_list(cls, filename):
"Create a MultiIndex from all files listed in a text file."
from .sourmash_args import (load_file_list_of_signatures,
load_file_as_index)
idx_list = []
src_list = []

file_list = load_file_list_of_signatures(filename)
for fname in file_list:
idx = load_file_as_index(fname)
src = fname

idx_list.append(idx)
src_list.append(src)

db = MultiIndex(idx_list, src_list)
return db

def save(self, *args):
raise NotImplementedError

Expand Down
20 changes: 8 additions & 12 deletions src/sourmash/lca/command_summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from ..logging import notify, error, print_results, set_quiet, debug
from . import lca_utils
from .lca_utils import check_files_exist
from sourmash.index import MultiIndex


DEFAULT_THRESHOLD=5
Expand Down Expand Up @@ -61,20 +62,15 @@ def load_singletons_and_count(filenames, ksize, scaled, ignore_abundance):
total_count = 0
n = 0

# in order to get the right reporting out of this function, we need
# to do our own traversal to expand the list of filenames, as opposed
# to using load_file_as_signatures(...)
filenames = sourmash_args.traverse_find_sigs(filenames)
filenames = list(filenames)

total_n = len(filenames)

for query_filename in filenames:
for filename in filenames:
n += 1
for query_sig in sourmash_args.load_file_as_signatures(query_filename,
ksize=ksize):
mi = MultiIndex.load_from_path(filename)
mi = mi.select(ksize=ksize)

for query_sig, query_filename in mi.signatures_with_location():
notify(u'\r\033[K', end=u'')
notify('... loading {} (file {} of {})', query_sig, n,
notify(f'... loading {query_sig} (file {n} of {total_n})',
total_n, end='\r')
total_count += 1

Expand All @@ -87,7 +83,7 @@ def load_singletons_and_count(filenames, ksize, scaled, ignore_abundance):
yield query_filename, query_sig, hashvals

notify(u'\r\033[K', end=u'')
notify('loaded {} signatures from {} files total.', total_count, n)
notify(f'loaded {total_count} signatures from {n} files total.')


def count_signature(sig, scaled, hashvals):
Expand Down
5 changes: 4 additions & 1 deletion src/sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"LCA database class and utilities."

import os
import json
import gzip
from collections import OrderedDict, defaultdict, Counter
Expand Down Expand Up @@ -187,6 +187,9 @@ def load(cls, db_name):
"Load LCA_Database from a JSON file."
from .lca_utils import taxlist, LineagePair

if not os.path.isfile(db_name):
raise ValueError(f"'{db_name}' is not a file and cannot be loaded as an LCA database")

xopen = open
if db_name.endswith('.gz'):
xopen = gzip.open
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ def load_signatures(
input_type = _detect_input_type(data)
if input_type == SigInput.UNKNOWN:
if do_raise:
raise Exception("Error in parsing signature; quitting. Cannot open file or invalid signature")
raise ValueError("Error in parsing signature; quitting. Cannot open file or invalid signature")
return

size = ffi.new("uintptr_t *")
Expand Down
Loading