`sourmash compare` runs out of memory on large comparisons #3134

yuzie0314 · 2024-04-30T03:49:03Z

Hi @ctb the sourmash author,

Currently we are working on using your tool to find the representative MAGs within a customed data set assembled from several deeply sequenced stool shotgun samples. Those MAGs actually are classified as the same family level using gtdbtk reference genomes.
We have up to 14,500 genomes in this data set, and we want to compute an ani pair-wise matrix using the following command.

sourmash compare -p 8 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*.sig

However, after around 1 hr processing, we got a weird error called BrokenPipeError, so we started to think if there is any limitation when using sourmash compare to generate an ani matrix. I think this kind of error is dereived from out off memory, correct me if I am wrong.

P.S. We are using 16 cores and 32 Gb ram, aws EC2 Linux.
we also saw a message called Killedison for index 886 done in 9.36945 seconds, which might be another reason why this error happend.
Current version is v4.8.2 sourmash.

The text was updated successfully, but these errors were encountered:

ctb · 2024-04-30T13:47:12Z

hi @yuzie0314 - yes, the Killed means sourmash used too much memory.

there is a long issue #2299 about this. we are still in a bit of a confused state in terms of recommendations, but the gist of that issue is:

you should be able to use pairwise and cluster from the branchwater plugin to do your clustering very quickly and in very low memory.

@bluegenes any words of wisdom here?

yuzie0314 · 2024-05-13T08:50:48Z

Hi @ctb,
I learned that commands pairwise and cluster might be useful, but I have some doubts about the results from them. The main products from sourmash compare is a pairwise ani matrix in csv and a numpy pickle file + label text file, so could we use pairwise to generate this? Another question is that I am a little confused about cluster command, what is the main cutoff/ threshold/ metrix to cluster genomes/signatures? is the clustering result based on ANI similarity or not merely rely on ANI? could you help us to clarify this?

This is a really good news to us,
since we want to improve the speed of our customed pipelines.
Yuzie

yuzie0314 · 2024-05-14T07:59:09Z

Hi @ctb the author,

I used the following command and generated a csv result.

sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_matrix.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all

The result is different from sourmash compare method:

Thanks for your help,
Yuzie

ctb · 2024-05-14T14:02:56Z

matrix vs CSV output

hi @yuzie0314, yep, the output of pairwise is a different format - this is because the numpy matrix format is not a sparse matrix, and for large comparisons/large collections it will have a lot of zeroes in it. In this case the CSV format is better, because it only stores pairs where there is a match.

Please see sourmash-bio/sourmash_plugin_branchwater#198 for a script that converts the CSV file into a numpy matrix. I haven't tried it yet myself, I'm afraid, but if you run into problems please feel free to post here and we'll see what we can do!

how does `cluster` work?

here is the basic description from sourmash-bio/sourmash_plugin_branchwater#234 -

clusteruses rustworkx-core (which internally uses petgraph) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by pairwise or multisearch, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

from the docs -

cluster takes a --similarity_column argument to specify which of the similarity columns, with the following choices: containment, max_containment, jaccard, average_containment_ani, maximum_containment_ani. All values should be input as fractions (e.g. 0.9 for 90%)

per @mr-eyes in sourmash-bio/sourmash_plugin_branchwater#252,

Community Detection: The current clustering algorithm is weakly_connected_component, @bluegenes tried it before with kSpider, and -as far as I remember- it did a great job in the ANI-based clustering of the GTDB-207. Here, I propose adopting community detection methods, which have been proven very useful in DBRetina, but I haven't tried them on DNA data.

ordination

we have a script here that can build an ordination view (MDS) from a matrix, and, with some small edits, should work on the output of pairwise - lmk if this is something you would be interested in.

further questions :)

what's your end goal? then we can see if we can help you get there!

cluster visualization?
and/or cluster extraction?

and/or something else?

ctb · 2024-05-19T15:13:11Z

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.
you might also be interested in the --cut-line and cluster output functionality :)
I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

yuzie0314 · 2024-05-20T04:14:43Z

wow, a little complicated answers, I would need some time to test on my environment, but thanks for your contribution again, really helpful @ctb.

I think the solutions you provided is worth us to explore. Will update you once we have any doubts and good news.

yuzie0314 · 2024-05-20T06:40:45Z

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.

you might also be interested in the --cut-line and cluster output functionality :)

I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

Hi @ctb,
I had a quick test on your command sourmash compare and sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare using four metagenomes.

The results from sourmash compare command alone:

The results from sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare:

As you can notice that compare would outputs the pair-wise ani comparison results even the ani score is low, but for the new combination commands we would only obtain partial information, which would be a little hard for us to interpret the results from the sourmash scripts pairwise. We also tried to tweak the -t 0 and expected that whether sourmash would greedily show the results between different comparison, but the results were all the same as we didn't assigned any threshold.

The commands we used:
1. sourmash compare -p 8 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*

2. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

3. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb · 2024-05-20T13:05:28Z

That's a great test! You should get identical results (although I will confess I have not tried it myself). I will try it out on my own set of data, but - I'm curious - why set --scaled 1000000 with pairwise? That would be my main guess as to why you're getting different results.

yuzie0314 · 2024-05-21T07:54:48Z

ohh ya,
I only wated to test whether adding --scaled 1000000 would change the results or not.
BTW, I also tested the command that dropped off this flag, the results are still different the original one.
The row and column are in the same order so we can clearly observed that the ani values in pairwise command are much lower than compare command.

Is there anything I missing? just let me know I would like to test in my environment.

The commands we used:
1. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

2. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb · 2024-05-21T13:51:58Z

I'll have to take a look. Have you compared the Jaccard index or containment matrices, rather than the ANI? I'm wondering if there's a difference in the ANI calculations - then Jaccard/containment would be the same, but ANI would be different. (Which would be a bug, just to be clear!)

yuzie0314 · 2024-05-29T01:27:23Z

Sorry for the late, we were only focusing on the ani comparison results, and didn't check other methods.
Is containment metric similar to ani? because from your document I learned that you have a containment ani column in gather results.

ctb · 2024-06-18T14:16:51Z

I am a little bit worried that the ANI numbers in sourmash_plugin_branchwater are incorrect - we are seeing differences in sourmash gather output per sourmash-bio/sourmash_plugin_branchwater#331 (comment). Sorry, just put 2+2 together and realized it may be an issue here! Will investigate and report back.

ctb · 2024-06-18T22:00:58Z

hi! I did some quick validation on a subset of Thermotoga genomes I had lying around.

`sourmash compare --ani --containment`

% sourmash compare /tmp/thermo.sig.zip -k 31 --ani --containment --labels-to /tmp/thermo.labels.csv

== This is sourmash version 4.8.10.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 3 signatures total.

0-CP000916.1 Ther...    [1.    0.915 0.925]
1-CP000702.1 Ther...    [0.914 1.    0.981]
2-CP000969.1 Ther...    [0.925 0.982 1.   ]

`sourmash scripts multisearch /tmp/thermo.sig.zip /tmp/thermo.sig.zip -k 31 -o /tmp/thermo.multisearch.csv --ani`

so the numbers look pretty similar.

I'm using multisearch here because pairwise doesn't return self-comparisons.

But, when I use pairwise, the numbers hold up:

aaaand... I think I've figured it out 😭 . The betterplot plugin command pairwise_to_matrix only pulls out the Jaccard column, not the ANI column.

Sigh. OK, I added the -u/--use-column to pairwise_to_matrix in the latest version of betterplot (which you can install with pip install -U sourmash_plugin_betterplot.

Now, when I run:

sourmash scripts pairwise thermo.sig.zip -k 31 -o thermo.pairwise.csv --ani --write-all
sourmash scripts pairwise_to_matrix thermo.pairwise.csv -o thermo.pairwise.mat -u average_containment_ani

and then look at the matrix:

>>> import numpy
>>> numpy.load('thermo.pairwise.mat')
array([[1.        , 0.91432939, 0.98119423],
       [0.91432939, 1.        , 0.9250674 ],
       [0.98119423, 0.9250674 , 1.        ]])

it all looks good.

Apologies for this misunderstanding!

ctb · 2024-06-20T13:19:27Z

ok, I sat down and did a much more thorough evaluation over in sourmash-bio/sourmash_plugin_branchwater#366, for the average_containment_ani column.

This comparison generated sourmash compare --ani --avg-containment results and compared them to the average_containment column output by sourmash scripts pairwise on the same data.

They are identical 🎉

I'll close this issue when I add a mention of the pairwise command to the sourmash documentation. Happy to take more questions here or in a new issue!

yuzie0314 · 2024-09-05T08:22:29Z

Hi @ctb sorry for the late,
I tested on your command several times, which works fine in sourmash scripts pairwise command, but I found that when I used the same command namely sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u average_containment_ani the results are still the jacarrd? and I think I've update the package you asked me to update pip install -U sourmash_plugin_betterplot.

my command :

sourmash scripts pairwise --cores 16 --ksize 31 --output ani_pairwise_u.csv --ani --write-all sig.path
sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u **average_containment_ani**

The results of the original command sourmash compare -p 16 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*:

In addition, I also checked their labels are in the same order....so what I could image the reasons are:

The results from sourmash scripts pairwis and sourmash compare are not always the same
sourmash scripts pairwise_to_matrix still using the jacarrd values in sourmash scripts pairwis.

but chances are that I think the main reason might be the former, because I also wrote a very ugly py code to transfer the results from pairwise to np.matrix....got the same results as sourmash scripts pairwise_to_matrix.
My comparisons are quite large, it's around 15,000 near completed genomes.
If you want me to provide a subset of our data, please let me know.

P.S.
The version in my singularity image:
sourmash: 0.8.11
branchwater: 0.9.7
sourmash-plugin-betterplot: 0.4.4

ctb changed the title ~~BrokenPipeError: [Errno 32] Broken pipe happend in sourmash compare command~~ sourmash compare runs out of memory on large comparisons May 14, 2024

ctb mentioned this issue May 20, 2024

write automated tests for pairwise_to_compare sourmash-bio/sourmash_plugin_betterplot#24

Open

This was referenced Jun 18, 2024

differences between this plugin's gather and sourmash gather sourmash-bio/sourmash_plugin_branchwater#331

Closed

MRG: add use-column to pairwise_to_matrix sourmash-bio/sourmash_plugin_betterplot#41

Merged

ctb mentioned this issue Jun 20, 2024

WIP: check pairwise vs sourmash compare output sourmash-bio/sourmash_plugin_branchwater#366

Open

3 tasks

This was referenced Jun 20, 2024

how much memory does sourmash compare need? #2299

Open

Pairwise producing empty csv sourmash-bio/sourmash_plugin_branchwater#368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sourmash compare` runs out of memory on large comparisons #3134

`sourmash compare` runs out of memory on large comparisons #3134

yuzie0314 commented Apr 30, 2024

ctb commented Apr 30, 2024

yuzie0314 commented May 13, 2024

yuzie0314 commented May 14, 2024

ctb commented May 14, 2024

ctb commented May 19, 2024

yuzie0314 commented May 20, 2024 •

edited

Loading

yuzie0314 commented May 20, 2024

ctb commented May 20, 2024

yuzie0314 commented May 21, 2024

ctb commented May 21, 2024

yuzie0314 commented May 29, 2024

ctb commented Jun 18, 2024

ctb commented Jun 18, 2024

ctb commented Jun 20, 2024

yuzie0314 commented Sep 5, 2024 •

edited

Loading

sourmash compare runs out of memory on large comparisons #3134

sourmash compare runs out of memory on large comparisons #3134

Comments

yuzie0314 commented Apr 30, 2024

ctb commented Apr 30, 2024

yuzie0314 commented May 13, 2024

yuzie0314 commented May 14, 2024

ctb commented May 14, 2024

matrix vs CSV output

how does cluster work?

ordination

further questions :)

ctb commented May 19, 2024

yuzie0314 commented May 20, 2024 • edited Loading

yuzie0314 commented May 20, 2024

ctb commented May 20, 2024

yuzie0314 commented May 21, 2024

ctb commented May 21, 2024

yuzie0314 commented May 29, 2024

ctb commented Jun 18, 2024

ctb commented Jun 18, 2024

sourmash compare --ani --containment

sourmash scripts multisearch /tmp/thermo.sig.zip /tmp/thermo.sig.zip -k 31 -o /tmp/thermo.multisearch.csv --ani

ctb commented Jun 20, 2024

yuzie0314 commented Sep 5, 2024 • edited Loading

`sourmash compare` runs out of memory on large comparisons #3134

`sourmash compare` runs out of memory on large comparisons #3134

how does `cluster` work?

yuzie0314 commented May 20, 2024 •

edited

Loading

`sourmash compare --ani --containment`

`sourmash scripts multisearch /tmp/thermo.sig.zip /tmp/thermo.sig.zip -k 31 -o /tmp/thermo.multisearch.csv --ani`

yuzie0314 commented Sep 5, 2024 •

edited

Loading