-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395
base: latest
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #3395 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## latest #3395 +/- ##
==========================================
+ Coverage 86.42% 86.45% +0.03%
==========================================
Files 137 137
Lines 16103 16156 +53
Branches 2219 2219
==========================================
+ Hits 13917 13968 +51
- Misses 1879 1881 +2
Partials 307 307
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hi Tessa, |
Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?
By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold. Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here? |
Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.
Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.
Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp |
Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify
Good point. I haven't done any thinking about optimization yet.
After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go? |
Parameterizing it should be an ideal solution, yes! Thank you!
I haven't documented any benchmark here, but I believe I did it that way for performance. |
Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that? I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Open to other ideas, though! |
I don't really have a use case in mind for different configurations. So this is good enough for the implementation, and as you said, we could add more later if needed. |
Make skipmers robust, but keep #3395 functional in the meantime. This PR: - enables second skipmer types, so we have m1n3 in addition to m2n3 - switches to a reading frame approach for both translation + skipmers, which means we first build the reading frame, then kmerize, rather than building kmers + translating/skipping on the fly - avoids "extended length" needed for skipping on the fly Since this changes the `SeqToHashes` strategy a bit, there's one python test where we now see a different error. Future thoughts: - with the new structure, it would be straightforward to add validation to exclude protein k-mers with invalid amino acids (`X`). I guess I'm not entirely sure what happens to those atm...
@ctb @mr-eyes @luizirber ready for review. |
On a first pass, looks good to me! I'm not thrilled with the Now I'm curious how hard it would be to add these to the Python layer 🤔 |
@luizirber any concerns, at least a hot-take level? |
ref #659 -- I think might need minor modification for this... |
This PR enables skipmers ONLY in the rust code.
SeqToHashes
to use reading frame struct, which simplifies/unifies the code across the different methods. The reading frame code handles any modifications needed - i.e. translation or skipping. Then we just kmerize the reading frame as usual. The main difference for translation is that we no longer need to store a buffer of all hashes from the reading frames.Since this changes the
SeqToHashes
strategy a bit, there's one python test where we now see a different error (modified).Future thoughts:
Skipmer References: