Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: parallelize loading sketches into memory #292

Merged
merged 3 commits into from
Apr 15, 2024
Merged

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Mar 25, 2024

+1 - faster loading
-1 - likely does not preserve ordering?

ref #268

Benchmarks

HG38 entire vs GTDB rs214 @ k=51

code Walltime % CPU RAM/RSS
v0.9.3 50m 47s 100% 24.5 GB
this PR 30m 38s 213% 24.7 GB

@bluegenes bluegenes changed the title WIP: parallelize loading sketches into memory MRG: parallelize loading sketches into memory Mar 26, 2024
@ctb
Copy link
Collaborator

ctb commented Apr 1, 2024

in re preserving ordering of signatures, here's what we have in the sourmash internals doc:

Gather on multiple collections, and order of search and reporting

Since sourmash gather will pick only one "best match" if there
are several (and will ignore the others), the order of searching
can matter for large collections. How does this work?

In brief, sourmash doesn't guarantee a particular load order for
sketches in a single collection, but it does guarantee that
collections are loaded and searched in their entirety in the order
that you provide them. So, for example, if you have a large zipfile
database of sketches that contains duplicates, you can't predict which
of the duplicates will be chosen as a match; but you can build your
own collection of prioritized matches as a separate database, and put
it first on the command line. A practical application of this might
be to list the GTDB "representatives" database first on the command
line, with the full GTDB database second, in order to prioritize
choosing representative genomes as matches over the rest.

This also plays a role in the order of reporting for prefetch
output - prefetch will report matching sketches in the order it
encounters them, which will match the order in which collections are
given to sourmash prefetch on the command line.

So anyway I think it's fine if ordering isn't preserved when loading :).

Do you have any speed benchmarks?

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how simple the changes are 😆

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

Mind if I merge this @bluegenes? I'm running into #268 myself - it's quite noticeable for 400k genomes!

@bluegenes
Copy link
Contributor Author

@ctb - go for it. I just didn't get around to benchmarking, so I really have no idea how much it will help. Would appreciate you dropping the time here, if you're running it!

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

optimized (I think) wheel built here:

/home/ctbrown/sourmash_plugin_branchwater/target/wheels/sourmash_plugin_branchwater-0.9.3-cp311-cp311-manylinux_2_35_x86_64.whl

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

last released version, v0.9.3: 50m 47s of compute, with one thread.

        Command being timed: "sourmash scripts fastmultigather hg38-entire.sig.zip /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k51.zip -k 51 -c 64"
        User time (seconds): 3027.95
        System time (seconds): 26.46
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 50:47.46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 24498088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 65294
        Minor (reclaiming a frame) page faults: 9933810
        Voluntary context switches: 6128
        Involuntary context switches: 6310
        Swaps: 0
        File system inputs: 15958816
        File system outputs: 48
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

This branch: 30m 38s with up to 64 threads (but, realistically, about 2).

        User time (seconds): 3890.21
        System time (seconds): 33.18
        Percent of CPU this job got: 213%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 30:37.63
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 24650256
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 879
        Minor (reclaiming a frame) page faults: 8784612
        Voluntary context switches: 157801
        Involuntary context switches: 130521
        Swaps: 0
        File system inputs: 173672
        File system outputs: 48
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

So I don't see any downsides to this (and I'll update the PR description at the top with the benchmarks) but it's not a panacea :(

@ctb ctb merged commit 4bdb73e into main Apr 15, 2024
1 check passed
@ctb ctb deleted the parallelize-load-sketches branch April 15, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants