sourmash sketch & search use one thread only #2458

jianshu93 · 2023-02-04T02:19:06Z

Dear Sourmash team,

I want to create sketches for all GTDB genomes and I am using the following command according to tutorial (I want one sketch per fasta file):

time sourmash sketch dna -p k=16,noabund --from-file ./gtdb_v207_name.txt -o ./gtdb_v207_sourmash

the gtdb_v207_name.txt is path of all gtdb genome files. However I noticed that sourmash always use only one thread to sketch all the files. It this the default option, or we need parallel it at task level for ourselves like parallel command to use all cores/threads.

Thanks,

Jianshu

ctb · 2023-02-04T16:29:14Z

hi @jianshu93 yes, 'tis true!

right now there are two suggested solutions -

use GNU parallel per example: using GNU parallel to sketch signatures in parallel. #1796
use snakemake or another workflow engine to run things in parallel (e.g. here is a snakemake Snakefile)

I've also built a simple plugin, sketchall to do it, but it's not really ready for anyone to use just yet 😓 - the plugin framework isn't released in any versions of sourmash yet, in particular!

tl;dr parallel should work great!

some backstory

The main blocker for me adding this into sourmash sketch has just been this issue: #1911 - we don't have a good multiprocess/multithread way to write sketches to a single file, and I am not enthusiastic about writing up something more clever (multiple consumers, one producer).

Also relevant: #1703 - not sure what's going on here!

jianshu93 · 2023-02-05T22:13:14Z

Hello Prof. C. Titus Brown,

Thanks for the quick response and it is helpful. I have no problems running sketch via parallel. However, the search command (after index the database, very fast, 20 minutes for all GTDB genomes) is also not parallelized, meaning when searching multiple queries, I still have to use parallel to do multiple searches. I am curious, compare to parallel searching the database (even for one query), task level parallel will be slower right because we need to initialize 8000 jobs to search 8000 queries. and also because processes cannot share memory with each other, we need #number of threads * database size memory to search #number of threads genomes. SBT can be easily paralleled to do search right since it is essential a tree like structure.

Thanks,

Jianshu

ctb · 2023-02-05T22:22:04Z

fantastic - glad the sketch stuff worked out!

Please see #2071 re our previous answer on search parallelization!

The short version is:

We have a couple different technologies we've been trying out for parallel search;
Different database types have very different search performance, so you get to "pick" your problem - I/O? or memory?
for now, the most simple and reliable and robust is probably still snakemake or parallel - i.e. process-level parallelism. If you're doing small query sketches against many small sketches, e.g. 8000 genomes against GTDB, then the memory and I/O considerations aren't too bad that way.

There are other technologies coming along but we don't have them at a good level, I'm afraid!

jianshu93 · 2023-02-06T17:45:57Z

Hello Prof. C. Titus Brown,

for single query search, it takes 4.20 minutes for searching and I use GNU parallel to do process-level parallelism (initializing multiple jobs), which is much slower and requires much more memory for searching for example 24 queries at the same time by GNU parallel (4.5G * 24 = 108G). It takes about 20 hours to search 8000 queries against GTDB. Is this normal of I miss something.

Thanks,

Jianshu

ctb · 2023-02-06T18:57:21Z

hi @jianshu93 per #1958, this sounds about right; those benchmarks are not for entire GTDB, but the numbers align with my expectations!

You could potentially speed things up (while reducing sensitivity a bit) by using --scaled=10000. sqldb would also support faster search, but at the cost of more memory and a LOT more disk space.

Thanks for reporting this! Gives us some targets!

jianshu93 · 2023-02-06T19:23:44Z

Hello Prof. C. Titus Brown,

I find this paper very interesting, published recently: https://dl.acm.org/doi/abs/10.1145/3448016.3457333

It is not SBT but beat SBT in many way it seems (N^(1/2)* log(N), very good sublinear algorithm). I am not aware of any Rust implementation though for this data structure.

Thanks,

Jianshu

ctb · 2023-02-06T19:25:34Z

thank you!

(Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO))

also ref #1110, #545

jianshu93 · 2023-02-18T04:52:20Z

Hello Prof. C. Titus Brown,
I use sourmash index to index all NCBI prokaryotic assemblies\genomes using the same sketch step above, that is all in refseq+GeneBank, a total of about 300k genomes, the index size is about 15G, If I want to search 24 queries at a time, I will need 24*15=360G, which is quite a lot for only 24 queries (I have 24 threads). Is there a way to reduce it somehow, e.g. I can split the database into pieces and search each piece and collect results from each piece and sort according to output distance or something. It seems to take some time to split the database. Any better idea to automate this process. I think have all the query have access to the database at the same time is quite important to reduce memory, that is to parallel search the database/ The RAMBO paper mentions that SBT was designed for single-thread, which is the bottleneck. It is still the bottleneck now right.

Thanks,

Jaisnhu

ctb · 2023-02-20T16:48:24Z

hi @jianshu93,

responses to a few of your questions - just remind me if I missed something!

SBT search uses lazy loading from disk, so the memory usage is not related to the size of the SBT on disk.
search is, generally speaking, fully scatter/gather compatible. That is, you can search any query against any subset of the database (scatter the search against a database shard) and then combine results afterwards. Likewise, you can search multiple queries against the same database and combine results afterwards.
SBTs themselves are not readily amenable to searching subsets (shards). That's because of their hierarchical nature; the top node of an SBT contains a Bloom filter with all the k-mers in the entire database. So you would want to build separate subsets of the entire database into their own SBTs.
I don't know how to address the question of SBTs and threading - sourmash doesn't support multithreading in general, so it's kind of a moot point here? More generally, it's true that SBTs don't take any advantage of multiple threads in a search, but if you are searching with multiple queries you could do multiple queries in multiple threads or processes and parallelize that way.

On to some practical advice -

If you want to index just a subset of a large database, you can do that with picklists - see docs. Basically, you create one or more CSV files containing the names or identifiers for the subset you want, and then run sourmash index like so:

sourmash index subset.sbt.zip all_signatures.zip --picklist filename.csv:identCol:ident

where all_signatures.zip is the entire database and subset.sbt.zip is the subset SBT you want to build.

Partly motivated by your interest, I made some progress this morning on a Rust-based manysearch plugin that builds on sourmash branchwater to do multithreaded searching - here are some stats:

impl	time	memory	notes
sourmash search	12m 43s	3.67 GB	single genome x 65k
manysearch	36s	139 MB	5 genomes x 65k; 32 threads

It's not really ready for anyone but me to use yet, and there are a few drawbacks to it, but I will keep you in the loop in this issue as it matures!

(I'm working on it over here)

jianshu93 · 2023-02-21T03:55:45Z

Thanks for the info. I will try and report back.

Thanks,

Jianshu

ctb · 2023-09-04T00:08:55Z

hi @jianshu93 the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch and multisearch commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!

jianshu93 · 2023-09-04T00:30:19Z

Hello Titus,
It is really nice to hear about that new command and I will definitely try it for 80K X 80 K.

Thanks,

Jianshu

ctb · 2023-09-23T16:23:49Z

pyo3_branchwater now supports massively parallel sketching - e.g. all of GTDB rs17 in 40 minutes and 2.7 GB of RAM.

see sourmash-bio/sourmash_plugin_branchwater#122 and sourmash-bio/sourmash_plugin_branchwater#96 (comment) for some numbers.

I'm leaving this open because it's not integrated into sourmash yet, tho :). That's coming eventually!

jianshu93 · 2023-09-24T04:10:32Z

Hello Titus,

This is amazing news and seems it is time for me to run some real dataset, e.g, entire NCBI/RefSeq genomes (318K). I will get back to you when I have some results.

Jianshu

ctb · 2023-09-24T13:44:13Z

great! please feel free to post questions here (in this issue tracker) since we monitor this more closely - and you can tag in @bluegenes if you like :)

ctb changed the title ~~sourish sketch one thread only~~ sourmash sketch uses one thread only Feb 4, 2023

ctb mentioned this issue Feb 23, 2023

Search a database with multiple genomes #2069

Open

ctb changed the title ~~sourmash sketch uses one thread only~~ sourmash sketch & search use one thread only Aug 2, 2023

ctb mentioned this issue Aug 2, 2023

sourmash plugins - ideas dumping ground #2453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sourmash sketch & search use one thread only #2458

sourmash sketch & search use one thread only #2458

jianshu93 commented Feb 4, 2023

ctb commented Feb 4, 2023

jianshu93 commented Feb 5, 2023

ctb commented Feb 5, 2023

jianshu93 commented Feb 6, 2023

ctb commented Feb 6, 2023

jianshu93 commented Feb 6, 2023

ctb commented Feb 6, 2023

jianshu93 commented Feb 18, 2023

ctb commented Feb 20, 2023 •

edited

Loading

jianshu93 commented Feb 21, 2023

ctb commented Sep 4, 2023

jianshu93 commented Sep 4, 2023

ctb commented Sep 23, 2023

jianshu93 commented Sep 24, 2023

ctb commented Sep 24, 2023

sourmash sketch & search use one thread only #2458

sourmash sketch & search use one thread only #2458

Comments

jianshu93 commented Feb 4, 2023

ctb commented Feb 4, 2023

some backstory

jianshu93 commented Feb 5, 2023

ctb commented Feb 5, 2023

jianshu93 commented Feb 6, 2023

ctb commented Feb 6, 2023

jianshu93 commented Feb 6, 2023

ctb commented Feb 6, 2023

jianshu93 commented Feb 18, 2023

ctb commented Feb 20, 2023 • edited Loading

jianshu93 commented Feb 21, 2023

ctb commented Sep 4, 2023

jianshu93 commented Sep 4, 2023

ctb commented Sep 23, 2023

jianshu93 commented Sep 24, 2023

ctb commented Sep 24, 2023

ctb commented Feb 20, 2023 •

edited

Loading