MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

bluegenes · 2024-11-13T00:19:52Z

This PR enables skipmers ONLY in the rust code.

enables two skipmer types: m1n3, m2n3
switches SeqToHashes to use reading frame struct, which simplifies/unifies the code across the different methods. The reading frame code handles any modifications needed - i.e. translation or skipping. Then we just kmerize the reading frame as usual. The main difference for translation is that we no longer need to store a buffer of all hashes from the reading frames.

Since this changes the SeqToHashes strategy a bit, there's one python test where we now see a different error (modified).

Future thoughts:

with the new structure, it would be straightforward to add validation for protein k-mers. I guess I'm not entirely sure what happens to those atm...

Skipmer References:

codspeed-hq · 2024-11-13T00:22:03Z

CodSpeed Performance Report

Merging #3395 will not alter performance

_{Comparing try-skipmers (8cb0bea) with latest (d22b860)}

Summary

✅ 21 untouched benchmarks

codecov · 2024-11-13T00:26:33Z

Codecov Report

Attention: Patch coverage is 89.47368% with 12 lines in your changes missing coverage. Please review.

Project coverage is 86.45%. Comparing base (d22b860) to head (8cb0bea).

Files with missing lines	Patch %	Lines
src/core/src/signature.rs	91.11%	8 Missing ⚠️
src/core/src/encodings.rs	66.66%	2 Missing ⚠️
src/core/src/sketch/minhash.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           latest    #3395      +/-   ##
==========================================
+ Coverage   86.42%   86.45%   +0.03%     
==========================================
  Files         137      137              
  Lines       16103    16156      +53     
  Branches     2219     2219              
==========================================
+ Hits        13917    13968      +51     
- Misses       1879     1881       +2     
  Partials      307      307

Flag	Coverage Δ
hypothesis-py	`25.43% <ø> (ø)`
python	`92.40% <ø> (ø)`
rust	`62.67% <89.47%> (+0.55%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mr-eyes · 2024-11-19T23:52:07Z

Hi Tessa,
Will you be allowing user-defined n, m,k here? And will you decide to construct the Skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

bluegenes · 2024-11-20T01:17:21Z

Hi Tessa, Will you be allowing user-defined n, m,k here?

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

And will you decide to construct the skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

mr-eyes · 2024-11-20T02:02:19Z

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp
and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

bluegenes · 2024-11-21T01:04:59Z

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Good point. I haven't done any thinking about optimization yet.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

mr-eyes · 2024-11-21T01:35:12Z

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

Parameterizing it should be an ideal solution, yes! Thank you!

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

I haven't documented any benchmark here, but I believe I did it that way for performance.

bluegenes · 2024-11-21T21:13:42Z

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

mr-eyes · 2024-11-21T21:35:57Z

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

I don't really have a use case in mind for different configurations. So this is good enough for the implementation, and as you said, we could add more later if needed.

Make skipmers robust, but keep #3395 functional in the meantime. This PR: - enables second skipmer types, so we have m1n3 in addition to m2n3 - switches to a reading frame approach for both translation + skipmers, which means we first build the reading frame, then kmerize, rather than building kmers + translating/skipping on the fly - avoids "extended length" needed for skipping on the fly Since this changes the `SeqToHashes` strategy a bit, there's one python test where we now see a different error. Future thoughts: - with the new structure, it would be straightforward to add validation to exclude protein k-mers with invalid amino acids (`X`). I guess I'm not entirely sure what happens to those atm...

bluegenes · 2024-12-12T22:01:50Z

@ctb @mr-eyes @luizirber ready for review.

ctb · 2024-12-13T14:16:21Z

On a first pass, looks good to me! I'm not thrilled with the Murmur64Skipm1n3 abbreviation style and would prefer something longer, but I don't have good suggestions and am not particularly against it, either; so, if you had a longer option in mind that you like, please consider it :)

Now I'm curious how hard it would be to add these to the Python layer 🤔

ctb · 2024-12-16T15:10:00Z

@luizirber any concerns, at least a hot-take level?

bluegenes · 2024-12-18T18:16:34Z

ref #659 -- I think might need minor modification for this...

bluegenes added 2 commits November 12, 2024 16:13

try skipmers

910b76c

Merge branch 'latest' into try-skipmers

117cd0d

bluegenes and others added 6 commits November 13, 2024 16:57

use hash fn encoding instead; init testing

2bac5fd

fix test

fdd1a44

Merge branch 'latest' into try-skipmers

55435d1

cont

ff33f7b

test

11978ed

skipmers must be at least ksize 3

2de520e

bluegenes mentioned this pull request Nov 20, 2024

EXP: skipmer sketching sourmash-bio/sourmash_plugin_branchwater#531

Open

Merge branch 'latest' into try-skipmers

cb2e7e1

clippy fix

d4a6200

Merge branch 'latest' into try-skipmers

d7f59cf

bluegenes mentioned this pull request Dec 3, 2024

WIP: skipmer improvements #3415

Merged

bluegenes changed the title ~~EXP: skipmers~~ MRG: add skipmers; switch to reading frame approach for translation, skipmers Dec 12, 2024

Merge branch 'latest' into try-skipmers

25a16ae

bluegenes added 3 commits December 12, 2024 14:58

try fix branchwater incompat

6359e16

roll back web-sys too

43e069c

add js-sys, web-sys to dependabot ignore

6008646

ctb added the rust label Dec 13, 2024

bluegenes and others added 4 commits December 16, 2024 11:44

add more tests

b20479c

more tests

e5ea8af

test protein rf display

8a0600d

Merge branch 'latest' into try-skipmers

8cb0bea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

bluegenes commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading

codecov bot commented Nov 13, 2024 •

edited

Loading

mr-eyes commented Nov 19, 2024

bluegenes commented Nov 20, 2024 •

edited

Loading

mr-eyes commented Nov 20, 2024 •

edited

Loading

bluegenes commented Nov 21, 2024 •

edited

Loading

mr-eyes commented Nov 21, 2024

bluegenes commented Nov 21, 2024 •

edited

Loading

mr-eyes commented Nov 21, 2024

bluegenes commented Dec 12, 2024

ctb commented Dec 13, 2024

ctb commented Dec 16, 2024

bluegenes commented Dec 18, 2024

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

Are you sure you want to change the base?

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

Conversation

bluegenes commented Nov 13, 2024 • edited Loading

codspeed-hq bot commented Nov 13, 2024 • edited Loading

CodSpeed Performance Report

Merging #3395 will not alter performance

Summary

codecov bot commented Nov 13, 2024 • edited Loading

Codecov Report

mr-eyes commented Nov 19, 2024

bluegenes commented Nov 20, 2024 • edited Loading

mr-eyes commented Nov 20, 2024 • edited Loading

bluegenes commented Nov 21, 2024 • edited Loading

mr-eyes commented Nov 21, 2024

bluegenes commented Nov 21, 2024 • edited Loading

mr-eyes commented Nov 21, 2024

bluegenes commented Dec 12, 2024

ctb commented Dec 13, 2024

ctb commented Dec 16, 2024

bluegenes commented Dec 18, 2024

bluegenes commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading

codecov bot commented Nov 13, 2024 •

edited

Loading

bluegenes commented Nov 20, 2024 •

edited

Loading

mr-eyes commented Nov 20, 2024 •

edited

Loading

bluegenes commented Nov 21, 2024 •

edited

Loading

bluegenes commented Nov 21, 2024 •

edited

Loading