Running gather on extremely big index file #2095

trickovicmatija · 2022-06-22T20:17:50Z

Hello,
Amazing software! I am new to "kmer field", so excuse me if I make some mistakes.
My idea is to quantify abundance of specific MAGs in metagenomic samples. For that, I have custom MAG catalog, which contains multiple MAGs per species. I sketched MAGs on species level using sketch fromfile (got one sig file per species). After that, I indexed all of those signatures together into one big (really big) .sbt.zip file. My idea is to use gather on that index file together with sketches of metagenomic samples in order to get abundances per MAG.
My problem is that it takes extremely long time to do that. I understand that this is (most probably) abusing the sourmash considering the size of the index, but I am thinking if I can somehow parallelize it myself. My idea is, instead of indexing all MAGs sigatures into one file, I could separate them (split all species-level signatures into n groups, and then run index n times). I am not sure how the abundance estimate is going to be affected by that? I understand that I would need to do some after-processing, but is it even possible?

Thanks!
Matija

ctb · 2022-06-23T04:57:30Z

tl;dr some details on how many genomes and what parameters you're using would be very welcome :). but maybe we have some solutions to offer!

Hello, Amazing software! I am new to "kmer field", so excuse me if I make some mistakes.

well, if you start with "amazing software" we will forgive you most anything 😂

My idea is to quantify abundance of specific MAGs in metagenomic samples.

yes!

For that, I have custom MAG catalog, which contains multiple MAGs per species. I sketched MAGs on species level using sketch fromfile (got one sig file per species).

ok!

After that, I indexed all of those signatures together into one big (really big) .sbt.zip file. My idea is to use gather on that index file together with sketches of metagenomic samples in order to get abundances per MAG. My problem is that it takes extremely long time to do that.

hmm, that's not necessarily expected.. what number of MAGs are we talking about here? 1000? 100,000? and what parameters are you using (k-mer size, etc.) for building the sketches?

we have run metagenomes against 1.3m genomes (see dib-lab/2020-paper-sourmash-gather#47) but it does sometimes take a while...

I understand that this is (most probably) abusing the sourmash considering the size of the index, but I am thinking if I can somehow parallelize it myself. My idea is, instead of indexing all MAGs sigatures into one file, I could separate them (split all species-level signatures into n groups, and then run index n times). I am not sure how the abundance estimate is going to be affected by that? I understand that I would need to do some after-processing, but is it even possible?

this is definitely not abusing sourmash! we expect to scale to ~millions of genomes easily!

so, hmm, if I understand you, there's two ways you can do this.

hacky option

one is the slightly hacky way that will give you slightly "incorrect" results (might or might not be too bad, depending on how you split things up). To restate what I think you want to do -

build a separate SBT for all MAGs at each genus level, or species level.
run gather using the metagenome against for each SBT
concatenate the gather results (using e.g. csvtk concat).

This should work, and the only problem is that if the k-mers in one index overlap with the k-mers in another index, those shared k-mers may be double-counted in the metagenome.

Or, to rephrase, for each k-mer in the metagenome that is shared with genomes present in separate SBTs, that k-mer will be assigned to at least two different genomes if gather is run separately.

this will happen VERY frequently if you split the genomes up by species, less frequently if you split them up by genus, and infrequently if you split the genomes up by family, because the k-mer overlaps will be close to zero.

formally correct option

the formally correct alternative (where you would get identical results at the end, but potentially much faster) would be to do the following:

split the genomes up into N databases however you like
run sourmash prefetch <metagenome> <database-split-N> -o <database-split-N.csv>
concatenate the resulting <database-split-N.csv> files with csvtk concat
run sourmash gather --picklist <combined-prefetch.csv>::prefetch <metagenome> <database-split-*.zip> so that sourmash just uses the genomes that have overlaps

this last one is something I once automated using snakemake; see #1664.

the problem is that this formally correct solution may not be that much faster, depending on where the slowdown is; there is a known problem (that we haven't thoroughly explored) where gather seems to be quite slow on many very closely related genomes, ref #1771.

trickovicmatija · 2022-06-23T09:11:43Z

Hey,
Thanks for the detailed answer!
Regarding the details, I have ~170k MAGs. Since I want to retain intra-species variability, I have multiple MAGs per species. The idea is to make kmers (more/less) MAG-specific, so I am using longer kmers (41-51), and also trying both 1000x and 100x scaling. I understand that lowering scaling would actually increase the ending index by 10x, but I want to be sure that I can actually have (again, more/less) MAG resolution.

Thanks for the tips. I will try first the option with prefetch, to see if I get any improvements. If not, I will try the other one. Since I am using a bit longer kmers, I suppose they should be specific on family level. I am also using some custom synthetic samples with known composition, so I will benchmark both approaches.

I will write back with the results.
Thanks!

trickovicmatija · 2022-07-05T11:27:43Z

Hello again,
I have tried things you suggested. Initially, I have tried with picklists, but that didn't give much of improvement. I proceeded making batches of MAGs on family level. Still, gather isn't finishing, even for families with 1-3 MAGs.
My commands:
Params: kmer_len=41, scaled = 100
I concat pairend reads before trim-low-abund.py with bbmap's reformat.sh script.

sourmash sketch fromfile {txt file with paths to MAGs on family level} -p dna,k={params.kmer_len},scaled={params.scaled} -o {output.family_batch}

trim-low-abund.py -C 3 -Z 18 -V -M {params.mem} -T {params.tmp_dir} --gzip -o {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz {params.tmp_dir}/{wildcards.sample}.fastq.gz
sourmash sketch dna -p k={params.kmer_len},abund,scaled={params.scaled} -o {output.sketch} {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz

sourmash gather -k {params.kmer_len} -o {output} {input.sketched_reads} {input.family_batch}

I understand that I am using 100 scaling, but I guess it should finish for those small families.

Do you see something obviously wrong with my workflow? If not, I guess it is something due to the formatting of my MAGs and/or sample composition.

Thanks,
Matija

ctb · 2022-07-05T13:48:49Z

quick question - have you been able to use sourmash prefetch successfully? Same command line syntax as sourmash gather. If that runs in reasonable time, it would tell us something useful.

trickovicmatija · 2022-07-05T13:52:09Z

Yeah, I tried that too. For a batch of 100 MAGs it takes ~50min.

ctb · 2022-07-05T13:59:00Z

yeesh, that's terrible... does it go faster for higher --scaled values? (you should be able to specify --scaled=1000 on the command line and have it downsample on the fly)

ctb · 2022-07-18T14:55:46Z

hi @trickovicmatija there are some performance improvements coming in #2123, and I have some other ideas, too. Will keep you updated.

ctb · 2022-07-21T13:47:36Z

hello @trickovicmatija we've just released sourmash v4.4.2, which contains #2123 - this is a substantial speedup in some circumstances. Our next release (not yet scheduled) will contain #2132, which is a further speedup of some different code.

benchmark results

using the command in #2123,

with v4.4.1: 205.90s
with v4.4.2: 67.12s
latest branch, next release: 46.15s

Hopefully this helps you as well! Let me know if you get a chance to try v4.4.2, which is now available via pip and conda!

trickovicmatija · 2022-07-25T13:54:48Z

Hey,
Thank you very much for the notification!
At the end, I am using scaled=1000 but using --threshold-bp=0 for prefetch and --threshold-bp=15000 for gather afterwards. For now, I am getting nice results for quantifying same MAGs I used to build a reference (collection of sbt files on family level). Now I am creating artificial fastq files with MAGs not in reference, and I will report the results.
I will give the new version a chance ASAP, since I don't need to change any of my code to use it :).

Thanks again for notifying me,
Matija

ctb · 2022-07-25T14:02:22Z

readlly glad to hear it :). v4.4.3 is almost out - it includes #2132 - and I'll close this when it's up on conda. (You can pip install it now, but conda-forge and bioconda take a bit longer.)

ctb · 2022-07-26T11:10:32Z

hi @trickovicmatija, sourmash v4.4.3 is available for your installation pleasure. Enjoy, and please let us know if you see any speedups, or run into show-stopping speed problems elsewhere!

ctb mentioned this issue Jul 6, 2022

running gather on many (very) closely related genomes is (very) slow #1771

Closed

ctb closed this as completed Jul 26, 2022

yuzie0314 mentioned this issue Nov 2, 2023

How to interpret the results from sourmash gather --picklist from prefetch? #2833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running gather on extremely big index file #2095

Running gather on extremely big index file #2095

trickovicmatija commented Jun 22, 2022 •

edited

Loading

ctb commented Jun 23, 2022

trickovicmatija commented Jun 23, 2022

trickovicmatija commented Jul 5, 2022 •

edited

Loading

ctb commented Jul 5, 2022

trickovicmatija commented Jul 5, 2022

ctb commented Jul 5, 2022

ctb commented Jul 18, 2022

ctb commented Jul 21, 2022

trickovicmatija commented Jul 25, 2022

ctb commented Jul 25, 2022

ctb commented Jul 26, 2022

Running gather on extremely big index file #2095

Running gather on extremely big index file #2095

Comments

trickovicmatija commented Jun 22, 2022 • edited Loading

ctb commented Jun 23, 2022

hacky option

formally correct option

trickovicmatija commented Jun 23, 2022

trickovicmatija commented Jul 5, 2022 • edited Loading

ctb commented Jul 5, 2022

trickovicmatija commented Jul 5, 2022

ctb commented Jul 5, 2022

ctb commented Jul 18, 2022

ctb commented Jul 21, 2022

benchmark results

trickovicmatija commented Jul 25, 2022

ctb commented Jul 25, 2022

ctb commented Jul 26, 2022

trickovicmatija commented Jun 22, 2022 •

edited

Loading

trickovicmatija commented Jul 5, 2022 •

edited

Loading