MRG: parallelize loading sketches into memory #292

bluegenes · 2024-03-25T23:52:52Z

+1 - faster loading
-1 - likely does not preserve ordering?

Benchmarks

HG38 entire vs GTDB rs214 @ k=51

code	Walltime	% CPU	RAM/RSS
v0.9.3	50m 47s	100%	24.5 GB
this PR	30m 38s	213%	24.7 GB

ctb · 2024-04-01T22:11:25Z

in re preserving ordering of signatures, here's what we have in the sourmash internals doc:

Gather on multiple collections, and order of search and reporting

Since sourmash gather will pick only one "best match" if there
are several (and will ignore the others), the order of searching
can matter for large collections. How does this work?

In brief, sourmash doesn't guarantee a particular load order for
sketches in a single collection, but it does guarantee that
collections are loaded and searched in their entirety in the order
that you provide them. So, for example, if you have a large zipfile
database of sketches that contains duplicates, you can't predict which
of the duplicates will be chosen as a match; but you can build your
own collection of prioritized matches as a separate database, and put
it first on the command line. A practical application of this might
be to list the GTDB "representatives" database first on the command
line, with the full GTDB database second, in order to prioritize
choosing representative genomes as matches over the rest.

This also plays a role in the order of reporting for prefetch
output - prefetch will report matching sketches in the order it
encounters them, which will match the order in which collections are
given to sourmash prefetch on the command line.

So anyway I think it's fine if ordering isn't preserved when loading :).

Do you have any speed benchmarks?

ctb

I like how simple the changes are 😆

ctb · 2024-04-15T17:24:46Z

Mind if I merge this @bluegenes? I'm running into #268 myself - it's quite noticeable for 400k genomes!

bluegenes · 2024-04-15T17:29:28Z

@ctb - go for it. I just didn't get around to benchmarking, so I really have no idea how much it will help. Would appreciate you dropping the time here, if you're running it!

ctb · 2024-04-15T18:06:02Z

optimized (I think) wheel built here:

/home/ctbrown/sourmash_plugin_branchwater/target/wheels/sourmash_plugin_branchwater-0.9.3-cp311-cp311-manylinux_2_35_x86_64.whl

ctb · 2024-04-15T19:01:51Z

last released version, v0.9.3: 50m 47s of compute, with one thread.

        Command being timed: "sourmash scripts fastmultigather hg38-entire.sig.zip /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k51.zip -k 51 -c 64"
        User time (seconds): 3027.95
        System time (seconds): 26.46
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 50:47.46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 24498088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 65294
        Minor (reclaiming a frame) page faults: 9933810
        Voluntary context switches: 6128
        Involuntary context switches: 6310
        Swaps: 0
        File system inputs: 15958816
        File system outputs: 48
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ctb · 2024-04-15T20:06:59Z

This branch: 30m 38s with up to 64 threads (but, realistically, about 2).

        User time (seconds): 3890.21
        System time (seconds): 33.18
        Percent of CPU this job got: 213%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 30:37.63
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 24650256
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 879
        Minor (reclaiming a frame) page faults: 8784612
        Voluntary context switches: 157801
        Involuntary context switches: 130521
        Swaps: 0
        File system inputs: 173672
        File system outputs: 48
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ctb · 2024-04-15T20:07:36Z

So I don't see any downsides to this (and I'll update the PR description at the top with the benchmarks) but it's not a panacea :(

bluegenes added 2 commits March 25, 2024 16:51

use par_iter to parallelize load_sketches

e8f1446

clean up

2c9c870

bluegenes changed the title ~~WIP: parallelize loading sketches into memory~~ MRG: parallelize loading sketches into memory Mar 26, 2024

ctb approved these changes Apr 1, 2024

View reviewed changes

Merge branch 'main' into parallelize-load-sketches

2eabab1

ctb merged commit 4bdb73e into main Apr 15, 2024
1 check passed

ctb deleted the parallelize-load-sketches branch April 15, 2024 20:19

ctb mentioned this pull request Apr 15, 2024

fastgather is faster than fastmultigather in loading the database #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: parallelize loading sketches into memory #292

MRG: parallelize loading sketches into memory #292

bluegenes commented Mar 25, 2024 •

edited by ctb

Loading

ctb commented Apr 1, 2024

Gather on multiple collections, and order of search and reporting

ctb left a comment

ctb commented Apr 15, 2024

bluegenes commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

MRG: parallelize loading sketches into memory #292

MRG: parallelize loading sketches into memory #292

Conversation

bluegenes commented Mar 25, 2024 • edited by ctb Loading

Benchmarks

ctb commented Apr 1, 2024

Gather on multiple collections, and order of search and reporting

ctb left a comment

Choose a reason for hiding this comment

ctb commented Apr 15, 2024

bluegenes commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

ctb commented Apr 15, 2024

bluegenes commented Mar 25, 2024 •

edited by ctb

Loading