-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running gather on extremely big index file #2095
Comments
tl;dr some details on how many genomes and what parameters you're using would be very welcome :). but maybe we have some solutions to offer!
well, if you start with "amazing software" we will forgive you most anything 😂
yes!
ok!
hmm, that's not necessarily expected.. what number of MAGs are we talking about here? 1000? 100,000? and what parameters are you using (k-mer size, etc.) for building the sketches? we have run metagenomes against 1.3m genomes (see dib-lab/2020-paper-sourmash-gather#47) but it does sometimes take a while...
this is definitely not abusing sourmash! we expect to scale to ~millions of genomes easily! so, hmm, if I understand you, there's two ways you can do this. hacky optionone is the slightly hacky way that will give you slightly "incorrect" results (might or might not be too bad, depending on how you split things up). To restate what I think you want to do -
This should work, and the only problem is that if the k-mers in one index overlap with the k-mers in another index, those shared k-mers may be double-counted in the metagenome. Or, to rephrase, for each k-mer in the metagenome that is shared with genomes present in separate SBTs, that k-mer will be assigned to at least two different genomes if gather is run separately. this will happen VERY frequently if you split the genomes up by species, less frequently if you split them up by genus, and infrequently if you split the genomes up by family, because the k-mer overlaps will be close to zero. formally correct optionthe formally correct alternative (where you would get identical results at the end, but potentially much faster) would be to do the following:
this last one is something I once automated using snakemake; see #1664. the problem is that this formally correct solution may not be that much faster, depending on where the slowdown is; there is a known problem (that we haven't thoroughly explored) where |
Hey, Thanks for the tips. I will try first the option with prefetch, to see if I get any improvements. If not, I will try the other one. Since I am using a bit longer kmers, I suppose they should be specific on family level. I am also using some custom synthetic samples with known composition, so I will benchmark both approaches. I will write back with the results. |
Hello again, sourmash sketch fromfile {txt file with paths to MAGs on family level} -p dna,k={params.kmer_len},scaled={params.scaled} -o {output.family_batch} trim-low-abund.py -C 3 -Z 18 -V -M {params.mem} -T {params.tmp_dir} --gzip -o {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz {params.tmp_dir}/{wildcards.sample}.fastq.gz
sourmash sketch dna -p k={params.kmer_len},abund,scaled={params.scaled} -o {output.sketch} {params.tmp_dir}/{wildcards.sample}_abundtrim.fastq.gz sourmash gather -k {params.kmer_len} -o {output} {input.sketched_reads} {input.family_batch} I understand that I am using 100 scaling, but I guess it should finish for those small families. Do you see something obviously wrong with my workflow? If not, I guess it is something due to the formatting of my MAGs and/or sample composition. Thanks, |
quick question - have you been able to use |
Yeah, I tried that too. For a batch of 100 MAGs it takes ~50min. |
yeesh, that's terrible... does it go faster for higher --scaled values? (you should be able to specify |
hi @trickovicmatija there are some performance improvements coming in #2123, and I have some other ideas, too. Will keep you updated. |
hello @trickovicmatija we've just released sourmash v4.4.2, which contains #2123 - this is a substantial speedup in some circumstances. Our next release (not yet scheduled) will contain #2132, which is a further speedup of some different code. benchmark resultsusing the command in #2123, with v4.4.1: 205.90s Hopefully this helps you as well! Let me know if you get a chance to try v4.4.2, which is now available via pip and conda! |
Hey, Thanks again for notifying me, |
readlly glad to hear it :). v4.4.3 is almost out - it includes #2132 - and I'll close this when it's up on conda. (You can pip install it now, but conda-forge and bioconda take a bit longer.) |
hi @trickovicmatija, sourmash v4.4.3 is available for your installation pleasure. Enjoy, and please let us know if you see any speedups, or run into show-stopping speed problems elsewhere! |
Hello,
Amazing software! I am new to "kmer field", so excuse me if I make some mistakes.
My idea is to quantify abundance of specific MAGs in metagenomic samples. For that, I have custom MAG catalog, which contains multiple MAGs per species. I sketched MAGs on species level using sketch fromfile (got one sig file per species). After that, I indexed all of those signatures together into one big (really big) .sbt.zip file. My idea is to use gather on that index file together with sketches of metagenomic samples in order to get abundances per MAG.
My problem is that it takes extremely long time to do that. I understand that this is (most probably) abusing the sourmash considering the size of the index, but I am thinking if I can somehow parallelize it myself. My idea is, instead of indexing all MAGs sigatures into one file, I could separate them (split all species-level signatures into n groups, and then run index n times). I am not sure how the abundance estimate is going to be affected by that? I understand that I would need to do some after-processing, but is it even possible?
Thanks!
Matija
The text was updated successfully, but these errors were encountered: