-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gather does not break ties in any consistent manner #1366
Comments
yes, I think what is happening is pretty much what you say: the containment of the two matches is equal, and sourmash does not "tie break" equal matches, it just picks the first one it finds! @bluegenes off the top of my head it seems like the max containment approach #1343 would not solve this, right? #278 is probably the right approach; also see #707 for more motivation. |
the implementation challenge until recently has been that doing this for really large collections of signatures is hard. however, we have some forthcoming solutions that help with this. it's probably not a next-week kind of feature tho, sorry :( |
Thanks for the swift answer! #1343 would not solve it because I only want to know the containment of plasmids into my assembly (not reversed). #278 would not help at all because my smaller plasmid already has a much larger containment score (if it would win the hashes), but the winner of the hashes is not based on containment score, but # of shared hashes. I will probably do the (suboptimal) following: |
It looks good to me. But if I only want the 'match containment', how does it differ with |
Thanks for taking a look! I think you need match containment and match bp to do what you want (which is tie break), and I'd be in favor of upgrading Actually, now that I think of it, |
#1370 will indeed provide a first-cut solution to this. |
similar issue over at marbl/Mash#159 |
Hi,
I'm using sourmash (3.5) to investigate plasmids in assemblies.
Currently, I think that the 'gather' funtion is best suitable for this goal: I want to find multiple plasmids if they are contained, but kmers that are shared among multiple plasmids should only go to one plasmid.
However, I'm having a problem. I have two plasmids (NC_011078 (larger), NZ_WYDM02000026 (smaller)) and for scaling x100, the small one is fully contained in the larger one.
However, when I gather the smaller plasmid against the file including the smaller and large plasmid, it will return the larger plasmid.
I guess that in the gather function a tie (of identical shared hashes) is somehow in favour for the larger plasmid (either because a random one or alphabetical first is selected as winner). I would like that in case of a tie, the plasmid with the least amount of kmers (and hence the largest containment score) would win the hashes. What do you think of this and is it a possible (optional) feature?
The text was updated successfully, but these errors were encountered: