Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise producing empty csv #368

Open
AroneyS opened this issue Jun 24, 2024 · 6 comments
Open

Pairwise producing empty csv #368

AroneyS opened this issue Jun 24, 2024 · 6 comments

Comments

@AroneyS
Copy link

AroneyS commented Jun 24, 2024

I am using sourmash to compare some custom kmer sets. I've used sourmash compare successfully, but in scaling up, I hit memory issues. To fix this, I tried sourmash scripts pairwise, which is much faster and less memory intensive in general, but I am getting some odd results in testing.

I have an example signature-set with 5 signatures, two identical. When I run sourmash compare, I get expected results, a Jaccard distance of 0 between the identical signatures and other comparisons as expected. When I run sourmash scripts pairwise, I get an empty csv. Any ideas?

example.sig: [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_1","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[4635994617403463936,14326303558305821343],"md5sum":"2ce198fcc7ea357fb5069a0e9448516c","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_5","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[749074125177198013,4736707972582783194,14326303558305821343],"md5sum":"d915856b77f4c6b57b612093d402decc","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_2","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[4635994617403463936,14326303558305821343],"md5sum":"2ce198fcc7ea357fb5069a0e9448516c","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_3","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[428577040998145274,749074125177198013,4736707972582783194],"md5sum":"f0a7bceac43003ed9287bec5b5003c22","molecule":"DNA"}],"version":0.4}]

Sourmash compare command

sourmash compare example.sig -o test.mat -k 60 --distance-matrix

Result

0-sample_1              [0.   0.75 0.   1.  ]
1-sample_5              [0.75 0.   0.75 0.5 ]
2-sample_2              [0.   0.75 0.   1.  ]
3-sample_3              [1.  0.5 1.  0. ]

Sourmash pairwise command

sourmash scripts pairwise example.sig -o test.csv -k 60

Log, produces empty file

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 60 / scaled: 1000 / moltype: DNA / threshold: 0.01
pairwise-comparing all sketches in 'example.sig' using 128 threads
Reading analysis(s) from: 'example.sig'
Loaded 4 analysis signature(s)
DONE. Processed 6 comparisons
...pairwise is done! results in 'test.csv'
@ctb
Copy link
Collaborator

ctb commented Jun 25, 2024

well, that is very weird - I have some ideas about identical sketches maybe causing problems with the way we load them - but I am traveling and can't debug in detail at the moment.

One question for you - have you tried using --write-all and does that fix or change anything useful?

potentially related issues / comments -

thank you for reporting this!

@AroneyS
Copy link
Author

AroneyS commented Jun 25, 2024

Adding --write-all produces the following csv. So, just the self-comparisons.

query_name,query_md5,match_name,match_md5,containment,max_containment,jaccard,intersect_hashes
sample_3,f0a7bceac43003ed9287bec5b5003c22,sample_3,f0a7bceac43003ed9287bec5b5003c22,1.0,1.0,1.0,0.0
sample_2,2ce198fcc7ea357fb5069a0e9448516c,sample_2,2ce198fcc7ea357fb5069a0e9448516c,1.0,1.0,1.0,0.0
sample_5,d915856b77f4c6b57b612093d402decc,sample_5,d915856b77f4c6b57b612093d402decc,1.0,1.0,1.0,0.0
sample_1,2ce198fcc7ea357fb5069a0e9448516c,sample_1,2ce198fcc7ea357fb5069a0e9448516c,1.0,1.0,1.0,0.0

@AroneyS
Copy link
Author

AroneyS commented Jun 25, 2024

If I add an extra hash to "mins" in both sample_1 and sample_3, then I get a comparison between sample_1 and sample_3 in the output. Though it has a Jaccard of 1.0 even though the threshold is the default at 0.01?

New csv

query_name,query_md5,match_name,match_md5,containment,max_containment,jaccard,intersect_hashes
sample_1,2ce198fcc7ea357fb5069a0e9448516c,sample_3,f0a7bceac43003ed9287bec5b5003c22,1.0,1.0,1.0,1.0

New signature

[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_1","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[4635994617403463936,14326303558305821343,1],"md5sum":"2ce198fcc7ea357fb5069a0e9448516c","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_3","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[428577040998145274,749074125177198013,4736707972582783194,1],"md5sum":"f0a7bceac43003ed9287bec5b5003c22","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_2","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[4635994617403463936,14326303558305821343],"md5sum":"2ce198fcc7ea357fb5069a0e9448516c","molecule":"DNA"}],"version":0.4},{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":null,"name":"sample_5","license":"CC0","signatures":[{"num":0,"ksize":60,"seed":42,"max_hash":18446744073709551615,"mins":[749074125177198013,4736707972582783194,14326303558305821343],"md5sum":"d915856b77f4c6b57b612093d402decc","molecule":"DNA"}],"version":0.4}]

@ctb
Copy link
Collaborator

ctb commented Jun 26, 2024

I took a look at this just now - based on the sourmash compare output above, the first problem is that you have sketches with a jaccard of 0.75 that aren't being reported!

I did a conversion and it looks like you are using a scaled of 1, right? Try specifying -s 1 to pairwise and see if that resolves this first issue. I suspect that sourmash may be automatically downsampling to scaled=1000 which results in empty sketches.

(I'll debug this more myself when I have a moment ;))

@AroneyS
Copy link
Author

AroneyS commented Jun 26, 2024

Perfect, thanks! That works. I now see scaled: 1000 in the log haha. Although the jaccard values seem inverted compared to sourmash compare.

query_name,query_md5,match_name,match_md5,containment,max_containment,jaccard,intersect_hashes
sample_2,2ce198fcc7ea357fb5069a0e9448516c,sample_5,d915856b77f4c6b57b612093d402decc,0.5,0.5,0.25,1.0
sample_3,f0a7bceac43003ed9287bec5b5003c22,sample_5,d915856b77f4c6b57b612093d402decc,0.6666666666666666,0.6666666666666666,0.5,2.0
sample_1,2ce198fcc7ea357fb5069a0e9448516c,sample_2,2ce198fcc7ea357fb5069a0e9448516c,1.0,1.0,1.0,2.0
sample_1,2ce198fcc7ea357fb5069a0e9448516c,sample_5,d915856b77f4c6b57b612093d402decc,0.5,0.5,0.25,1.0

@ctb
Copy link
Collaborator

ctb commented Jun 27, 2024

thanks for this - we really do need to document the distance vs similarity thing better! ref sourmash-bio/sourmash#2406

please leave this issue open for a bit - I'm going to create some new issues to fix and document the various things you've highlighted for us! thanks again!

  • "silent" downsampling should be less silent
  • comparing/using empty sketches should not be done silently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants