Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash scripts singlesketch output differs from sourmash sketch #538

Open
peterjc opened this issue Dec 4, 2024 · 7 comments
Open

sourmash scripts singlesketch output differs from sourmash sketch #538

peterjc opened this issue Dec 4, 2024 · 7 comments

Comments

@peterjc
Copy link

peterjc commented Dec 4, 2024

I would expect sourmash scripts singlesketch and sourmash sketch to give identical signature files. However, they differ slightly (in the metadata).

Using viral genome https://github.com/pyani-plus/pyani-plus/blob/main/tests/fixtures/viral_example/OP073605.fasta as a test case. This is deliberately not in the current directory to clarify the issue (see below). Testing here on macOS with Python 3.12:

sourmash sketch dna -p 'k=31,scaled=300' viral_example/OP073605.fasta -o OP073605.sketch.sig 

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

computing signatures for files: viral_example/OP073605.fasta
Computing a total of 1 signature(s) for each input.
... reading sequences from viral_example/OP073605.fasta
calculated 1 signatures for 1 sequences in viral_example/OP073605.fasta
saved 1 signature(s) to 'OP073605.sketch.sig'. Note: signature license is CC0.
sourmash scripts singlesketch -I DNA -p 'k=31,scaled=300' viral_example/OP073605.fasta -o OP073605.singlesketch.sig 

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['k=31,scaled=300,dna']
sketching file 'viral_example/OP073605.fasta' (DNA) with params 'k=31,scaled=300,dna' and name 'OP073605.fasta' using a single thread
calculated 1 signatures for 1 sequences in viral_example/OP073605.fasta
...singlesketch is done! results in 'OP073605.singlesketch.sig'

We expect the signatures to be identical, but they are not:

diff OP073605.sketch.sig OP073605.singlesketch.sig  
1c1
< [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"viral_example/OP073605.fasta","license":"CC0","signatures":[{"num":0,"ksize":31,"seed":42,"max_hash":61489146912365176,"mins":[223359313939310,255902985413107,268549649144099,277155917919381,292071552996963,444322978038914,1616280023507850,3338671461323924,3594648714110559,3767382997029753,4868685502036020,5642382402699381,5700777553149780,5972184149542315,6326728594480526,6338102459668065,6838640760708711,6905228029855391,7056722554158822,7083895583502315,7122767561569484,7490631495995146,7624321967367682,8290078934157253,8504151967019293,8531270803413845,8717589411516309,8782186786512532,9655844910273299,9801312498245497,10209573619716001,10408846965293042,10507208265957512,10609932742420686,10729976505865435,10759310176931332,11233350849067448,11331425680613431,11711745670093358,12472363265616169,12722779547192287,13405548663537054,13847215456149809,15041634566599075,15543439418800804,15811858956390139,16265171681686249,16334217997760957,16354114234807245,16490909594784257,16640218161331565,17525036616496808,17719265726961591,17893294877791227,19366071964868141,20088790250464445,20419785472256286,20667117500119697,20941542834691179,21019437484675754,21058005145139261,21794377672504840,22518964677804040,23076327295556434,23077941487370526,23224076542621517,23824069078472694,23959923543888189,24867670871191163,25036812798791158,25569903435819362,25863942190740389,26195141647980582,26360799809443306,26577470667397182,27301878787922600,27857720679982305,29230098017264601,30711259692148281,31272392498980882,31733185105963213,32056899553713148,32362935566690854,32591563100768670,32636101274110673,33029824628758318,33144758166638682,34184661051294128,34412533422473598,35106600441884417,35637777702266964,35648671620141666,36002518049757271,36464333068958669,36471775348901262,36481272099771423,36657555677080860,36759815633975038,36815885555406920,37169409193713995,37688325078557754,38030141248893033,38141402202973639,38267700422845032,38297943632893068,38878791748781861,39349699755120026,39360393585236754,39459108645090597,40281021913814517,40704204521500943,40722165817366063,41735319249928298,41905945525144455,42162537580448710,42284876976660591,42453798040381783,42687390066347756,42717102413904164,42722239161607688,42983158138618876,43197066353860896,43288796269537732,43730024918430635,43849097058857837,44216749845806884,44417115582010216,44493993723316576,44631197968284924,44750023189096775,44829980260841425,45058678011578378,45852195707351631,45935444527533915,46434225938123577,46685021396965900,46997356559997252,47918040042351454,47942068430343494,48261739483325728,48413851538054504,48598642419960020,48677883696808794,49131444174370685,49274650486150772,49589074888364950,49707802113686359,50211081817285644,50249815009869785,50466116901561557,50481308429691029,50576497643378616,50794910100360829,50951311995651054,51259682933467668,51359628524153101,51739818876174155,52552397806122612,52771404823192343,53103572518491137,53285113143935373,53305079721659707,53364013318216039,53390979069982797,53862843851759723,53913787171437978,54082122717633960,54210389765451613,54604121376089049,55296957247028152,55789214349096241,56626338220073205,57424616189959906,57633875222488487,57780362742444893,57927038037401643,58357800680333129,58767990981390156,58815063282682639,58886925963663061,59959524285991627,60877100994411872,61052699136230028],"md5sum":"4078d475da429726799c896ec1bf17ec","molecule":"DNA"}],"version":0.4}]
\ No newline at end of file
---
> [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"viral_example/OP073605.fasta","name":"OP073605.fasta","license":"CC0","signatures":[{"num":0,"ksize":31,"seed":42,"max_hash":61489146912365176,"mins":[223359313939310,255902985413107,268549649144099,277155917919381,292071552996963,444322978038914,1616280023507850,3338671461323924,3594648714110559,3767382997029753,4868685502036020,5642382402699381,5700777553149780,5972184149542315,6326728594480526,6338102459668065,6838640760708711,6905228029855391,7056722554158822,7083895583502315,7122767561569484,7490631495995146,7624321967367682,8290078934157253,8504151967019293,8531270803413845,8717589411516309,8782186786512532,9655844910273299,9801312498245497,10209573619716001,10408846965293042,10507208265957512,10609932742420686,10729976505865435,10759310176931332,11233350849067448,11331425680613431,11711745670093358,12472363265616169,12722779547192287,13405548663537054,13847215456149809,15041634566599075,15543439418800804,15811858956390139,16265171681686249,16334217997760957,16354114234807245,16490909594784257,16640218161331565,17525036616496808,17719265726961591,17893294877791227,19366071964868141,20088790250464445,20419785472256286,20667117500119697,20941542834691179,21019437484675754,21058005145139261,21794377672504840,22518964677804040,23076327295556434,23077941487370526,23224076542621517,23824069078472694,23959923543888189,24867670871191163,25036812798791158,25569903435819362,25863942190740389,26195141647980582,26360799809443306,26577470667397182,27301878787922600,27857720679982305,29230098017264601,30711259692148281,31272392498980882,31733185105963213,32056899553713148,32362935566690854,32591563100768670,32636101274110673,33029824628758318,33144758166638682,34184661051294128,34412533422473598,35106600441884417,35637777702266964,35648671620141666,36002518049757271,36464333068958669,36471775348901262,36481272099771423,36657555677080860,36759815633975038,36815885555406920,37169409193713995,37688325078557754,38030141248893033,38141402202973639,38267700422845032,38297943632893068,38878791748781861,39349699755120026,39360393585236754,39459108645090597,40281021913814517,40704204521500943,40722165817366063,41735319249928298,41905945525144455,42162537580448710,42284876976660591,42453798040381783,42687390066347756,42717102413904164,42722239161607688,42983158138618876,43197066353860896,43288796269537732,43730024918430635,43849097058857837,44216749845806884,44417115582010216,44493993723316576,44631197968284924,44750023189096775,44829980260841425,45058678011578378,45852195707351631,45935444527533915,46434225938123577,46685021396965900,46997356559997252,47918040042351454,47942068430343494,48261739483325728,48413851538054504,48598642419960020,48677883696808794,49131444174370685,49274650486150772,49589074888364950,49707802113686359,50211081817285644,50249815009869785,50466116901561557,50481308429691029,50576497643378616,50794910100360829,50951311995651054,51259682933467668,51359628524153101,51739818876174155,52552397806122612,52771404823192343,53103572518491137,53285113143935373,53305079721659707,53364013318216039,53390979069982797,53862843851759723,53913787171437978,54082122717633960,54210389765451613,54604121376089049,55296957247028152,55789214349096241,56626338220073205,57424616189959906,57633875222488487,57780362742444893,57927038037401643,58357800680333129,58767990981390156,58815063282682639,58886925963663061,59959524285991627,60877100994411872,61052699136230028],"md5sum":"4078d475da429726799c896ec1bf17ec","molecule":"DNA"}],"version":0.4}]
\ No newline at end of file

The only difference is singlesketch has added ,"name":"OP073605.fasta" which seems almost redundant given the "filename":"viral_example/OP073605.fasta" entry.

@ctb
Copy link
Collaborator

ctb commented Dec 4, 2024

Try adding --name to sourmash sketch dna.

@peterjc
Copy link
Author

peterjc commented Dec 4, 2024

That works, thanks. Then the two signatures are identical.

So, if you want consistency, should sourmash scripts singlesketch drop the default name behaviour?

Or, should the same default behaviour be added to sourmash sketch?

If not we can just close this issue.

@ctb
Copy link
Collaborator

ctb commented Dec 4, 2024

I'll think on't!

One thing that we may need to write more clearly in the docs here: the branchwater plugin is not intended to be fully identical to sourmash in its particulars. Our medium term goal is to integrate the speed improvements from branchwater into sourmash proper, but that involves a still rather tremendous amount of detail work 😭 .

The reasons why are that sourmash has much more functionality and many more tests as well, for many more edge cases. Satisfying all of those edge cases will take a lot! In the meantime, the branchwater plugin supports massive speed advantages and is helping us identify exactly what we need to add to the higher level sourmash Rust API. It's also served as a testbed for some of us to learn Rust ;).

That all having been said: the core comparisons should yield identical results - search and containment and gather. If they don't, that is a bug! And if there are features in sourmash that you need in branchwater, please let us know!

@peterjc
Copy link
Author

peterjc commented Dec 11, 2024

After some head-scratching I realised that this "trivial" difference in the signatures is important with branchwater. This might be better filed as a separate issue (just say the word).

Classical sourmash compare will accept either form of *.sig file via all_sigs.csv (with and without the name field):

sourmash compare all_sigs.csv --csv sourmash.csv --estimate-ani --containment -k=31

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loaded 3 signatures total.

0-MGV-GENOME-0264...    [1.    0.996 0.984]
1-MGV-GENOME-0266...    [0.998 1.    0.982]
2-OP073605.fasta        [1.    0.996 1.   ]
min similarity in matrix: 0.982

However, manysearch will only accept the signatures with the name field! e.g.

sourmash scripts manysearch -m DNA -o manysearch.csv all_sigs.csv all_sigs.csv --scaled 300 -k 31 --debug

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 300 / moltype: DNA / threshold: 0.01
searching all sketches in 'all_sigs.csv' against 'all_sigs.csv' using 8 threads
selection scaled: Some(300)
Reading query(s) from: 'all_sigs.csv'
Loaded 3 query signature(s)
Reading search(s) from: 'all_sigs.csv'
Loaded 3 search signature(s)
DONE. Processed 3 search sigs
...manysearch is done! results in 'manysearch.csv'

query             p_genome avg_abund   p_metag   metagenome name
--------          -------- ---------   -------   ---------------
MGV-GENOME-026457  100.0%     N/A        N/A     MGV-GENOME-026457
MGV-GENOME-026457   99.1%     N/A        N/A     OP073605.fasta
MGV-GENOME-026457   93.7%     N/A        N/A     MGV-GENOME-026645
MGV-GENOME-026645   88.9%     N/A        N/A     MGV-GENOME-026457
MGV-GENOME-026645   88.9%     N/A        N/A     OP073605.fasta
MGV-GENOME-026645  100.0%     N/A        N/A     MGV-GENOME-026645
OP073605.fasta      60.1%     N/A        N/A     MGV-GENOME-026457
OP073605.fasta     100.0%     N/A        N/A     OP073605.fasta
OP073605.fasta      56.8%     N/A        N/A     MGV-GENOME-026645

If I generated the *.sig files like this (without setting a name), it fails:

rm *.sig *.csv sourmash sketch dna -p "k=31,scaled=300" OP073605.fasta -o OP073605.sig
...sourmash sketch dna -p "k=31,scaled=300" MGV-GENOME-0264574.fas -o MGV-GENOME-0264574.sig
...sourmash sketch dna -p "k=31,scaled=300" MGV-GENOME-0266457.fna -o MGV-GENOME-0266457.sig
...sourmash sig collect --quiet -F csv -o all_sigs.csv *.sigLoading signature information from MGV-GENOME-0264574.sig.
Loading signature information from MGV-GENOME-0266457.sig.
Loading signature information from OP073605.sig.
saved 3 manifest rows to 'all_sigs.csv'sourmash scripts manysearch -m DNA -o manysearch.csv all_sigs.csv all_sigs.csv --scaled 300 -k 31 --debug

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 300 / moltype: DNA / threshold: 0.01
searching all sketches in 'all_sigs.csv' against 'all_sigs.csv' using 8 threads
selection scaled: Some(300)
Reading query(s) from: 'all_sigs.csv'
Error: No query signatures loaded, exiting.

The error message is unhelpful here.

@ctb
Copy link
Collaborator

ctb commented Dec 11, 2024

well that's a straight up bug, methinks ;).

@ctb
Copy link
Collaborator

ctb commented Dec 15, 2024

Fascinating. The problem only occurs when you use a manifest, it seems. It works fine with a .sig or a .sig.zip file.

I've tracked it down in Collection::intersect_manifest() in sourmash, but even though the code is simple I don't understand where it's going awry 😅 .

@ctb
Copy link
Collaborator

ctb commented Dec 15, 2024

(I've figured it out - sourmash-bio/sourmash#3434)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants