Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running gather on extremely big index file #2095

Closed
trickovicmatija opened this issue Jun 22, 2022 · 11 comments
Closed

Running gather on extremely big index file #2095

trickovicmatija opened this issue Jun 22, 2022 · 11 comments

Comments

@trickovicmatija
Copy link

trickovicmatija commented Jun 22, 2022

Hello,
Amazing software! I am new to "kmer field", so excuse me if I make some mistakes.
My idea is to quantify abundance of specific MAGs in metagenomic samples. For that, I have custom MAG catalog, which contains multiple MAGs per species. I sketched MAGs on species level using sketch fromfile (got one sig file per species). After that, I indexed all of those signatures together into one big (really big) .sbt.zip file. My idea is to use gather on that index file together with sketches of metagenomic samples in order to get abundances per MAG.
My problem is that it takes extremely long time to do that. I understand that this is (most probably) abusing the sourmash considering the size of the index, but I am thinking if I can somehow parallelize it myself. My idea is, instead of indexing all MAGs sigatures into one file, I could separate them (split all species-level signatures into n groups, and then run index n times). I am not sure how the abundance estimate is going to be affected by that? I understand that I would need to do some after-processing, but is it even possible?

Thanks!
Matija

@ctb
Copy link
Contributor

ctb commented Jun 23, 2022

tl;dr some details on how many genomes and what parameters you're using would be very welcome :). but maybe we have some solutions to offer!

Hello, Amazing software! I am new to "kmer field", so excuse me if I make some mistakes.

well, if you start with "amazing software" we will forgive you most anything 😂

My idea is to quantify abundance of specific MAGs in metagenomic samples.

yes!

For that, I have custom MAG catalog, which contains multiple MAGs per species. I sketched MAGs on species level using sketch fromfile (got one sig file per species).

ok!

After that, I indexed all of those signatures together into one big (really big) .sbt.zip file. My idea is to use gather on that index file together with sketches of metagenomic samples in order to get abundances per MAG. My problem is that it takes extremely long time to do that.

hmm, that's not necessarily expected.. what number of MAGs are we talking about here? 1000? 100,000? and what parameters are you using (k-mer size, etc.) for building the sketches?

we have run metagenomes against 1.3m genomes (see dib-lab/2020-paper-sourmash-gather#47) but it does sometimes take a while...

I understand that this is (most probably) abusing the sourmash considering the size of the index, but I am thinking if I can somehow parallelize it myself. My idea is, instead of indexing all MAGs sigatures into one file, I could separate them (split all species-level signatures into n groups, and then run index n times). I am not sure how the abundance estimate is going to be affected by that? I understand that I would need to do some after-processing, but is it even possible?

this is definitely not abusing sourmash! we expect to scale to ~millions of genomes easily!

so, hmm, if I understand you, there's two ways you can do this.

hacky option

one is the slightly hacky way that will give you slightly "incorrect" results (might or might not be too bad, depending on how you split things up). To restate what I think you want to do -

  1. build a separate SBT for all MAGs at each genus level, or species level.
  2. run gather using the metagenome against for each SBT
  3. concatenate the gather results (using e.g. csvtk concat).

This should work, and the only problem is that if the k-mers in one index overlap with the k-mers in another index, those shared k-mers may be double-counted in the metagenome.

Or, to rephrase, for each k-mer in the metagenome that is shared with genomes present in separate SBTs, that k-mer will be assigned to at least two different genomes if gather is run separately.

this will happen VERY frequently if you split the genomes up by species, less frequently if you split them up by genus, and infrequently if you split the genomes up by family, because the k-mer overlaps will be close to zero.

formally correct option

the formally correct alternative (where you would get identical results at the end, but potentially much faster) would be to do the following:

  • split the genomes up into N databases however you like
  • run sourmash prefetch <metagenome> <database-split-N> -o <database-split-N.csv>
  • concatenate the resulting <database-split-N.csv> files with csvtk concat
  • run sourmash gather --picklist <combined-prefetch.csv>::prefetch <metagenome> <database-split-*.zip> so that sourmash just uses the genomes that have overlaps

this last one is something I once automated using snakemake; see #1664.

the problem is that this formally correct solution may not be that much faster, depending on where the slowdown is; there is a known problem (that we haven't thoroughly explored) where gather seems to be quite slow on many very closely related genomes, ref #1771.

@trickovicmatija
Copy link
Author

Hey,
Thanks for the detailed answer!
Regarding the details, I have ~170k MAGs. Since I want to retain intra-species variability, I have multiple MAGs per species. The idea is to make kmers (more/less) MAG-specific, so I am using longer kmers (41-51), and also trying both 1000x and 100x scaling. I understand that lowering scaling would actually increase the ending index by 10x, but I want to be sure that I can actually have (again, more/less) MAG resolution.

Thanks for the tips. I will try first the option with prefetch, to see if I get any improvements. If not, I will try the other one. Since I am using a bit longer kmers, I suppose they should be specific on family level. I am also using some custom synthetic samples with known composition, so I will benchmark both approaches.

I will write back with the results.
Thanks!

@trickovicmatija
Copy link
Author

trickovicmatija commented Jul 5, 2022

Hello again,
I have tried things you suggested. Initially, I have tried with picklists, but that didn't give much of improvement. I proceeded making batches of MAGs on family level. Still, gather isn't finishing, even for families with 1-3 MAGs.
My commands:
Params: kmer_len=41, scaled = 100
I concat pairend reads before trim-low-abund.py with bbmap's reformat.sh script.

sourmash sketch fromfile {txt file with paths to MAGs on family level} -p dna,k={params.kmer_len},scaled={params.scaled} -o {output.family_batch}
trim-low-abund.py -C 3 -Z 18 -V -M {params.mem} -T {params.tmp_dir} --gzip -o {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz {params.tmp_dir}/{wildcards.sample}.fastq.gz
sourmash sketch dna -p k={params.kmer_len},abund,scaled={params.scaled} -o {output.sketch} {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz
sourmash gather -k {params.kmer_len} -o {output} {input.sketched_reads} {input.family_batch}

I understand that I am using 100 scaling, but I guess it should finish for those small families.

Do you see something obviously wrong with my workflow? If not, I guess it is something due to the formatting of my MAGs and/or sample composition.

Thanks,
Matija

@ctb
Copy link
Contributor

ctb commented Jul 5, 2022

quick question - have you been able to use sourmash prefetch successfully? Same command line syntax as sourmash gather. If that runs in reasonable time, it would tell us something useful.

@trickovicmatija
Copy link
Author

Yeah, I tried that too. For a batch of 100 MAGs it takes ~50min.

@ctb
Copy link
Contributor

ctb commented Jul 5, 2022

yeesh, that's terrible... does it go faster for higher --scaled values? (you should be able to specify --scaled=1000 on the command line and have it downsample on the fly)

@ctb
Copy link
Contributor

ctb commented Jul 18, 2022

hi @trickovicmatija there are some performance improvements coming in #2123, and I have some other ideas, too. Will keep you updated.

@ctb
Copy link
Contributor

ctb commented Jul 21, 2022

hello @trickovicmatija we've just released sourmash v4.4.2, which contains #2123 - this is a substantial speedup in some circumstances. Our next release (not yet scheduled) will contain #2132, which is a further speedup of some different code.

benchmark results

using the command in #2123,

with v4.4.1: 205.90s
with v4.4.2: 67.12s
latest branch, next release: 46.15s

Hopefully this helps you as well! Let me know if you get a chance to try v4.4.2, which is now available via pip and conda!

@trickovicmatija
Copy link
Author

Hey,
Thank you very much for the notification!
At the end, I am using scaled=1000 but using --threshold-bp=0 for prefetch and --threshold-bp=15000 for gather afterwards. For now, I am getting nice results for quantifying same MAGs I used to build a reference (collection of sbt files on family level). Now I am creating artificial fastq files with MAGs not in reference, and I will report the results.
I will give the new version a chance ASAP, since I don't need to change any of my code to use it :).

Thanks again for notifying me,
Matija

@ctb
Copy link
Contributor

ctb commented Jul 25, 2022

readlly glad to hear it :). v4.4.3 is almost out - it includes #2132 - and I'll close this when it's up on conda. (You can pip install it now, but conda-forge and bioconda take a bit longer.)

@ctb
Copy link
Contributor

ctb commented Jul 26, 2022

hi @trickovicmatija, sourmash v4.4.3 is available for your installation pleasure. Enjoy, and please let us know if you see any speedups, or run into show-stopping speed problems elsewhere!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants