Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark pairwise --> cluster #247

Open
bluegenes opened this issue Feb 27, 2024 · 4 comments
Open

benchmark pairwise --> cluster #247

bluegenes opened this issue Feb 27, 2024 · 4 comments

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Feb 27, 2024

review comment: #234 (comment)

could you add some minimal benchmarks (time/memory) for a standard-ish comparison, e.g. gtdb-reps, so that users know what to expect from both pairwise and cluster for a real-ish analysis? ISTR it's pretty fast against gtdb-reps.

If benchmark is slow, consider parallelizing reading. It was originally done in #234 but removed for simplicity.

pairwise files can be millions of lines long. Would it be faster to parallel read them, store them in an edges vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from names_to_node that are not already in the graph to preserve singletons.

bluegenes added a commit that referenced this issue Feb 27, 2024
This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`.

`cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

`cluster` outputs two files: 
1. cluster identities file: `Component_X, name1;name2;name3...`
2. cluster size histogram `cluster_size, count`

context for some things I tried:
- try using petgraph directly and removing rustworkx dependency
> nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps
- try using 'extend_with_edges' instead of add_edge logic.
> nope, only in `petgraph`

**Punted Issues:**
- develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248)
- enable updating clusters, rather than always regenerating from scratch (#249)
- benchmark `cluster` (#247)
>  `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons.


Related issues:

* #219
* sourmash-bio/sourmash#2271
* sourmash-bio/sourmash#700
* sourmash-bio/sourmash#225
* sourmash-bio/sourmash#274


---------

Co-authored-by: C. Titus Brown <[email protected]>
@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 28, 2024

🚀 5 seconds on gtdb-rs214-reps with average_containment_ani default threshold (0.95)

I used 16 threads but %CPU was 123% (which makes sense, since cluster is not actually parallelized)

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv'
                       cluster counts in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv --similarity-column average_containment_ani --cluster-sizes gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv"
        User time (seconds): 4.03
        System time (seconds): 2.07
        Percent of CPU this job got: 123%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.95
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 109292
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 36690
        Voluntary context switches: 3028
        Involuntary context switches: 424
        Swaps: 0
        File system inputs: 0
        File system outputs: 3264
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

doesn't change much for lowered threshold

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv'
                       cluster counts in 'None'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv --similarity-column average_containment_ani --threshold 0.8"
        User time (seconds): 3.76
        System time (seconds): 1.87
        Percent of CPU this job got: 125%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 126164
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 41457
        Voluntary context switches: 3204
        Involuntary context switches: 716
        Swaps: 0
        File system inputs: 0
        File system outputs: 1968
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@bluegenes bluegenes changed the title benchmark cluster benchmark pairwise --> cluster Feb 28, 2024
@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 28, 2024

pairwise to build cluster input file took much longer, of course. ~2 hours for gtdb-rs214-reps using 16 threads

No ANI, no write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip -o gtdb-rs214-reps.k31.pairwise.csv"
        User time (seconds): 143454.56
        System time (seconds): 136.08
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:33:08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573808
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 17555
        Minor (reclaiming a frame) page faults: 79509579
        Voluntary context switches: 1134599
        Involuntary context switches: 1262188
        Swaps: 0
        File system inputs: 4486144
        File system outputs: 412944
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI, no write-all

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --ani -o gtdb-rs214-reps.k31.pairwise-ani.csv"
        User time (seconds): 143272.02
        System time (seconds): 80.51
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573456
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 32007457
        Voluntary context switches: 1181205
        Involuntary context switches: 1298635
        Swaps: 0
        File system inputs: 0
        File system outputs: 528008
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI + write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 107618.34
        System time (seconds): 245.74
        Percent of CPU this job got: 1551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:55:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4575736
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 89
        Minor (reclaiming a frame) page faults: 63129113
        Voluntary context switches: 1118873
        Involuntary context switches: 1699826
        Swaps: 0
        File system inputs: 13792
        File system outputs: 547384
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Collaborator

ctb commented Feb 28, 2024

it seems a little weird that ANI + write-all took half an hour less wall time, right? But that could just be fluctuations on the computer running things.

@bluegenes
Copy link
Contributor Author

bluegenes commented Mar 21, 2024

benchmarking pairwise using GTDB-rs214 reps on 64 threads for comparison with multisearch (#89)

85205 x 85205 pairwise comparisons (3.6 billion comparisons non-self, non-redundant comparisons) in 44m with 64 threads (and 4.56 GB RAM).

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 149275.64
        System time (seconds): 54.49
        Percent of CPU this job got: 5612%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 44:20.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4566188
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18246
        Minor (reclaiming a frame) page faults: 6700145
        Voluntary context switches: 1193450
        Involuntary context switches: 1579877
        Swaps: 0
        File system inputs: 4610752
        File system outputs: 547336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants