-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"reverse" containment #1198
Comments
hi @phiweger, no, nothing is available via the command line, I'm afraid. It's fairly straightforward via the API, tho. I'm wondering if |
how would you solve this using the API (I just need a pointer to the main functions to look at)? |
On Wed, Sep 23, 2020 at 08:57:26AM -0700, Adrian Viehweger wrote:
how would you solve this using the API (I just need a pointer to the main functions to look at)?
I would iterate over all the signatures in the database or directory using the
load_file_as_signatures in sourmash_args, and then run sig.contained_by(query).
I'll try to give you some working code later today or tomorrow :)
|
This at least executes 😁 file
|
Oh wow, thank you very much. I'll give this a spin and report back! |
welcome & please do!
|
@ctb is |
Ah yes, just did tested this empirically ;) It works! Thanks a lot, this will be very useful to me. |
Excellent! This script is using some of the newer API calls behind the following statement here in the docs --
so you should be able to use any database or collection of sigs as input for |
(and glad it's helpful!) |
Sorry to bother, I initially commented out a line so for completeness there seems to be a minor issue: python reverse-gather.py some.sig some.sbt.zip
Traceback (most recent call last):
results.sort(reverse=True)
TypeError: '<' not supported between instances of 'SourmashSignature' and 'SourmashSignature' Commenting out Thanks @ctb |
Does |
the results are sorted for a single database, but not when retrieved for multiple databases, I think. Not sure why the sort isn't working... OH! It's because there are conditions where sort is looking at the second element of the list (because the containment, the first element, is equal). If you do something like
that should fix it. |
👋 Hello! I'm also interested in a CLI-level usage. Here's the use case:
Is the Python API level the main way to go right now? |
for now, yes. I'd suggest writing/modifying a script (and posting it here :), and then updating it as you find it does (or doesn't) suite your needs - then we can make it more generic and include it in sourmash proper! |
personally I'm a fan of the CLI usage in the prefetch script posted in #1126:
as it gives you a lot of flexibility. That code also makes full use of the |
note that |
@ctb I tried
and get something that looks like what I am looking for. Am I correct to think that
|
hi @phiweger please try this, which makes use of APIs available in sourmash 4 - #! /usr/bin/env python
import sys
import sourmash
from sourmash.sourmash_args import load_file_as_signatures, load_dbs_and_sigs
import argparse
def main():
p = argparse.ArgumentParser()
p.add_argument('many_sigs', help='collection to query')
p.add_argument('dbs', help='databases to query', nargs='+')
args = p.parse_args()
query_list = list(load_file_as_signatures(args.many_sigs))
# grab first query to use for load_dbs_and_sigs
first_query = query_list[0]
dbs = load_dbs_and_sigs(args.dbs, first_query, False)
for query_sig in query_list:
results = []
for db in dbs:
results.extend(db.prefetch(query_sig, threshold_bp=0))
print(results[:5])
results.sort(reverse=True, key=lambda x: x.score)
print('query:', query_sig.name)
if not results:
print(' ** no matches **')
else:
for sr in results:
print(f' {sr.score:.3f} {sr.signature.name} in {sr.location}')
return 0
if __name__ == '__main__':
sys.exit(main()) |
note that if you run this with a set of query signatures with different k-mer or scaled or whatnot, you'll get an incompatible MinHash error somewhere in there, because the script doesn't make sure that all query sketches are compatible. I can fix this up if it's important, but it adds a lot of boilerplate code that's annoying to look at so I left it out for now :) |
Thanks! Prefetch will treat each sig independently correct? Like, say sig A has 10 hashes in common w/ my query ("QandA") and sig B has 15 in common ("QandB"), then the intersection of QandA and QandB is not necessarily empty. Like, for recursive gather, I have to take the top prefetch hit, remove the hashes from the query, then repeat? |
yes, correct. we do have internal APIs for gather that you can use if you want to do the min-set-cov/gather; lmk. |
Would be really great if you could point me to some code to implement this/ not reinvent the wheel. Thx a lot! |
try out the below! I'm not sure it does what you want - it does treat each query independently - but hopefully the code is reasonably clear as to what it's actually doing :) #! /usr/bin/env python
import sys
import sourmash
from sourmash.sourmash_args import load_file_as_signatures, load_dbs_and_sigs
from sourmash.search import GatherDatabases
import argparse
def main():
p = argparse.ArgumentParser()
p.add_argument('many_sigs', help='collection to query')
p.add_argument('dbs', help='databases to query', nargs='+')
args = p.parse_args()
query_list = list(load_file_as_signatures(args.many_sigs))
# grab first query to use for load_dbs_and_sigs
first_query = query_list[0]
dbs = load_dbs_and_sigs(args.dbs, first_query, False)
for query_sig in query_list:
results = []
for gather_result, weighted_missed in GatherDatabases(query_sig, dbs,
threshold_bp=0):
results.append(gather_result)
print(results[:5])
# see src/sourmash/search.py, GatherResult namedtuple
results.sort(reverse=True, key=lambda x: x.intersect_bp)
print('query:', query_sig.name)
if not results:
print(' ** no matches **')
else:
for gr in results:
print(f' {gr.intersect_bp} {gr.match.name}')
return 0
if __name__ == '__main__':
sys.exit(main()) |
thanks a lot @ctb I will report |
I think the new plugin #2970 provides the necessary functionality in a nicely packaged way. 🎉 |
I recently had this use case: I have a large database of say phages, and I want to know if they are contained in my query genome.
Ideally, I'd like to index them in an SBT and then search using containment but currently this is not possible right? Bc/
sourmash search --containment genome index
asks whether the genome is contained in records from the index. Is there a way to efficiently do the reverse, iesourmash search --containment index genome
?Thanks!
The text was updated successfully, but these errors were encountered: