-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash sketch & search use one thread only #2458
Comments
hi @jianshu93 yes, 'tis true! right now there are two suggested solutions -
I've also built a simple plugin, sketchall to do it, but it's not really ready for anyone to use just yet 😓 - the plugin framework isn't released in any versions of sourmash yet, in particular! tl;dr parallel should work great! some backstoryThe main blocker for me adding this into Also relevant: #1703 - not sure what's going on here! |
Hello Prof. C. Titus Brown, Thanks for the quick response and it is helpful. I have no problems running sketch via parallel. However, the search command (after index the database, very fast, 20 minutes for all GTDB genomes) is also not parallelized, meaning when searching multiple queries, I still have to use parallel to do multiple searches. I am curious, compare to parallel searching the database (even for one query), task level parallel will be slower right because we need to initialize 8000 jobs to search 8000 queries. and also because processes cannot share memory with each other, we need #number of threads * database size memory to search #number of threads genomes. SBT can be easily paralleled to do search right since it is essential a tree like structure. Thanks, Jianshu |
fantastic - glad the sketch stuff worked out! Please see #2071 re our previous answer on search parallelization! The short version is:
There are other technologies coming along but we don't have them at a good level, I'm afraid! |
Hello Prof. C. Titus Brown, for single query search, it takes 4.20 minutes for searching and I use GNU parallel to do process-level parallelism (initializing multiple jobs), which is much slower and requires much more memory for searching for example 24 queries at the same time by GNU parallel (4.5G * 24 = 108G). It takes about 20 hours to search 8000 queries against GTDB. Is this normal of I miss something. Thanks, Jianshu |
hi @jianshu93 per #1958, this sounds about right; those benchmarks are not for entire GTDB, but the numbers align with my expectations! You could potentially speed things up (while reducing sensitivity a bit) by using Thanks for reporting this! Gives us some targets! |
Hello Prof. C. Titus Brown, I find this paper very interesting, published recently: https://dl.acm.org/doi/abs/10.1145/3448016.3457333 It is not SBT but beat SBT in many way it seems (N^(1/2)* log(N), very good sublinear algorithm). I am not aware of any Rust implementation though for this data structure. Thanks, Jianshu |
Hello Prof. C. Titus Brown, Thanks, Jaisnhu |
hi @jianshu93, responses to a few of your questions - just remind me if I missed something!
On to some practical advice - If you want to index just a subset of a large database, you can do that with picklists - see docs. Basically, you create one or more CSV files containing the names or identifiers for the subset you want, and then run sourmash index like so:
where Partly motivated by your interest, I made some progress this morning on a Rust-based
It's not really ready for anyone but me to use yet, and there are a few drawbacks to it, but I will keep you in the loop in this issue as it matures! (I'm working on it over here) |
Thanks for the info. I will try and report back. Thanks, Jianshu |
hi @jianshu93 the pyo3_branchwater plugin is getting pretty mature - you might be interested in the |
Hello Titus, Thanks, Jianshu |
pyo3_branchwater now supports massively parallel sketching - e.g. all of GTDB rs17 in 40 minutes and 2.7 GB of RAM. see sourmash-bio/sourmash_plugin_branchwater#122 and sourmash-bio/sourmash_plugin_branchwater#96 (comment) for some numbers. I'm leaving this open because it's not integrated into sourmash yet, tho :). That's coming eventually! |
Hello Titus, This is amazing news and seems it is time for me to run some real dataset, e.g, entire NCBI/RefSeq genomes (318K). I will get back to you when I have some results. Jianshu |
great! please feel free to post questions here (in this issue tracker) since we monitor this more closely - and you can tag in @bluegenes if you like :) |
Dear Sourmash team,
I want to create sketches for all GTDB genomes and I am using the following command according to tutorial (I want one sketch per fasta file):
time sourmash sketch dna -p k=16,noabund --from-file ./gtdb_v207_name.txt -o ./gtdb_v207_sourmash
the gtdb_v207_name.txt is path of all gtdb genome files. However I noticed that sourmash always use only one thread to sketch all the files. It this the default option, or we need parallel it at task level for ourselves like parallel command to use all cores/threads.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: