Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash sketch & search use one thread only #2458

Open
jianshu93 opened this issue Feb 4, 2023 · 15 comments
Open

sourmash sketch & search use one thread only #2458

jianshu93 opened this issue Feb 4, 2023 · 15 comments

Comments

@jianshu93
Copy link

Dear Sourmash team,

I want to create sketches for all GTDB genomes and I am using the following command according to tutorial (I want one sketch per fasta file):

time sourmash sketch dna -p k=16,noabund --from-file ./gtdb_v207_name.txt -o ./gtdb_v207_sourmash

the gtdb_v207_name.txt is path of all gtdb genome files. However I noticed that sourmash always use only one thread to sketch all the files. It this the default option, or we need parallel it at task level for ourselves like parallel command to use all cores/threads.

Thanks,

Jianshu

@ctb ctb changed the title sourish sketch one thread only sourmash sketch uses one thread only Feb 4, 2023
@ctb
Copy link
Contributor

ctb commented Feb 4, 2023

hi @jianshu93 yes, 'tis true!

right now there are two suggested solutions -

I've also built a simple plugin, sketchall to do it, but it's not really ready for anyone to use just yet 😓 - the plugin framework isn't released in any versions of sourmash yet, in particular!

tl;dr parallel should work great!

some backstory

The main blocker for me adding this into sourmash sketch has just been this issue: #1911 - we don't have a good multiprocess/multithread way to write sketches to a single file, and I am not enthusiastic about writing up something more clever (multiple consumers, one producer).

Also relevant: #1703 - not sure what's going on here!

@jianshu93
Copy link
Author

Hello Prof. C. Titus Brown,

Thanks for the quick response and it is helpful. I have no problems running sketch via parallel. However, the search command (after index the database, very fast, 20 minutes for all GTDB genomes) is also not parallelized, meaning when searching multiple queries, I still have to use parallel to do multiple searches. I am curious, compare to parallel searching the database (even for one query), task level parallel will be slower right because we need to initialize 8000 jobs to search 8000 queries. and also because processes cannot share memory with each other, we need #number of threads * database size memory to search #number of threads genomes. SBT can be easily paralleled to do search right since it is essential a tree like structure.

Thanks,

Jianshu

@ctb
Copy link
Contributor

ctb commented Feb 5, 2023

fantastic - glad the sketch stuff worked out!

Please see #2071 re our previous answer on search parallelization!

The short version is:

  • We have a couple different technologies we've been trying out for parallel search;
  • Different database types have very different search performance, so you get to "pick" your problem - I/O? or memory?
  • for now, the most simple and reliable and robust is probably still snakemake or parallel - i.e. process-level parallelism. If you're doing small query sketches against many small sketches, e.g. 8000 genomes against GTDB, then the memory and I/O considerations aren't too bad that way.

There are other technologies coming along but we don't have them at a good level, I'm afraid!

@jianshu93
Copy link
Author

Hello Prof. C. Titus Brown,

for single query search, it takes 4.20 minutes for searching and I use GNU parallel to do process-level parallelism (initializing multiple jobs), which is much slower and requires much more memory for searching for example 24 queries at the same time by GNU parallel (4.5G * 24 = 108G). It takes about 20 hours to search 8000 queries against GTDB. Is this normal of I miss something.

Thanks,

Jianshu

@ctb
Copy link
Contributor

ctb commented Feb 6, 2023

hi @jianshu93 per #1958, this sounds about right; those benchmarks are not for entire GTDB, but the numbers align with my expectations!

You could potentially speed things up (while reducing sensitivity a bit) by using --scaled=10000. sqldb would also support faster search, but at the cost of more memory and a LOT more disk space.

Thanks for reporting this! Gives us some targets!

@jianshu93
Copy link
Author

Hello Prof. C. Titus Brown,

I find this paper very interesting, published recently: https://dl.acm.org/doi/abs/10.1145/3448016.3457333

It is not SBT but beat SBT in many way it seems (N^(1/2)* log(N), very good sublinear algorithm). I am not aware of any Rust implementation though for this data structure.

Thanks,

Jianshu

@ctb
Copy link
Contributor

ctb commented Feb 6, 2023

thank you!

(Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO))

also ref #1110, #545

@jianshu93
Copy link
Author

Hello Prof. C. Titus Brown,
I use sourmash index to index all NCBI prokaryotic assemblies\genomes using the same sketch step above, that is all in refseq+GeneBank, a total of about 300k genomes, the index size is about 15G, If I want to search 24 queries at a time, I will need 24*15=360G, which is quite a lot for only 24 queries (I have 24 threads). Is there a way to reduce it somehow, e.g. I can split the database into pieces and search each piece and collect results from each piece and sort according to output distance or something. It seems to take some time to split the database. Any better idea to automate this process. I think have all the query have access to the database at the same time is quite important to reduce memory, that is to parallel search the database/ The RAMBO paper mentions that SBT was designed for single-thread, which is the bottleneck. It is still the bottleneck now right.

Thanks,

Jaisnhu

@ctb
Copy link
Contributor

ctb commented Feb 20, 2023

hi @jianshu93,

responses to a few of your questions - just remind me if I missed something!

  • SBT search uses lazy loading from disk, so the memory usage is not related to the size of the SBT on disk.
  • search is, generally speaking, fully scatter/gather compatible. That is, you can search any query against any subset of the database (scatter the search against a database shard) and then combine results afterwards. Likewise, you can search multiple queries against the same database and combine results afterwards.
  • SBTs themselves are not readily amenable to searching subsets (shards). That's because of their hierarchical nature; the top node of an SBT contains a Bloom filter with all the k-mers in the entire database. So you would want to build separate subsets of the entire database into their own SBTs.
  • I don't know how to address the question of SBTs and threading - sourmash doesn't support multithreading in general, so it's kind of a moot point here? More generally, it's true that SBTs don't take any advantage of multiple threads in a search, but if you are searching with multiple queries you could do multiple queries in multiple threads or processes and parallelize that way.

On to some practical advice -

If you want to index just a subset of a large database, you can do that with picklists - see docs. Basically, you create one or more CSV files containing the names or identifiers for the subset you want, and then run sourmash index like so:

sourmash index subset.sbt.zip all_signatures.zip --picklist filename.csv:identCol:ident

where all_signatures.zip is the entire database and subset.sbt.zip is the subset SBT you want to build.

Partly motivated by your interest, I made some progress this morning on a Rust-based manysearch plugin that builds on sourmash branchwater to do multithreaded searching - here are some stats:

impl time memory notes
sourmash search 12m 43s 3.67 GB single genome x 65k
manysearch 36s 139 MB 5 genomes x 65k; 32 threads

It's not really ready for anyone but me to use yet, and there are a few drawbacks to it, but I will keep you in the loop in this issue as it matures!

(I'm working on it over here)

@jianshu93
Copy link
Author

Thanks for the info. I will try and report back.

Thanks,

Jianshu

@ctb ctb changed the title sourmash sketch uses one thread only sourmash sketch & search use one thread only Aug 2, 2023
@ctb
Copy link
Contributor

ctb commented Sep 4, 2023

hi @jianshu93 the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch and multisearch commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!

@jianshu93
Copy link
Author

Hello Titus,
It is really nice to hear about that new command and I will definitely try it for 80K X 80 K.

Thanks,

Jianshu

@ctb
Copy link
Contributor

ctb commented Sep 23, 2023

pyo3_branchwater now supports massively parallel sketching - e.g. all of GTDB rs17 in 40 minutes and 2.7 GB of RAM.

see sourmash-bio/sourmash_plugin_branchwater#122 and sourmash-bio/sourmash_plugin_branchwater#96 (comment) for some numbers.

I'm leaving this open because it's not integrated into sourmash yet, tho :). That's coming eventually!

@jianshu93
Copy link
Author

Hello Titus,

This is amazing news and seems it is time for me to run some real dataset, e.g, entire NCBI/RefSeq genomes (318K). I will get back to you when I have some results.

Jianshu

@ctb
Copy link
Contributor

ctb commented Sep 24, 2023

great! please feel free to post questions here (in this issue tracker) since we monitor this more closely - and you can tag in @bluegenes if you like :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants