Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelizing large database search with snakemake #1690

Open
ctb opened this issue Jul 28, 2021 · 2 comments
Open

parallelizing large database search with snakemake #1690

ctb opened this issue Jul 28, 2021 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 28, 2021

see https://github.com/ctb/2021-sourmash-greymake2 -- README in sum,

2021-sourmash-greymake2

parallelize containment searches of large sourmash databases using
manifests, picklists, and snakemake.

Briefly, this code -

  • spits a database manifest into 25 batches and saves the batches into CSV files
  • uses the CSV files as picklists to search the database in parallel with prefetch
  • combines the resulting prefetch output into a single picklist and then uses that to search the database again, to generate the final output

ref #1664 which did something similar with manifests-of-manifests, but in a more general (and more complex) way.

@ctb
Copy link
Contributor Author

ctb commented Jul 29, 2021

7 minutes for all GTDB with greymake2 and 4 threads,

        Command being timed: "snakemake -j 4"
        User time (seconds): 1496.79
        System time (seconds): 21.72
        Percent of CPU this job got: 359%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:02.35
        Maximum resident set size (kbytes): 775924

vs 21 minutes with normal ol' sourmash (single-threaded).

        Command being timed: "sourmash gather 63.fa.sig gtdb-rs202.genomic.k31.zip -o matches2.csv"
        User time (seconds): 1238.62
        System time (seconds): 7.19
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 20:49.51
        Maximum resident set size (kbytes): 780244

Interesting to note that max memory is the same; that's probably some combination of manifest + matching signatures, although I'm not sure why you wouldn't get num_threads x manifest memory 🤔

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

after #1891 is merged, we can update this to the manifests directly, w/o picklists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant