-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refine_metabat2 running out of memory #208
Comments
Hey Cameron, Thanks for using Aviary and apologies you've come across this issue. Seems like the refinement step crashing on the first iteration is due to out of memory issues, which then don't get properly exited until later on in the pipeline which is interesting. How big are the MAGs that are being produced by metabat2 and how many reads are you using for binning? It might be that I underestimated how memory intensive some refinement processes can get if provided a lot of samples, but we shall see. Cheers, |
Hey Rhys, Thanks for getting back to me! Metabat2 produced 397 bins, largest one being 387MB, with an average of about 5MB. I used one long read (PacBio HiFi ~2M reads ~16Gbp) and one short read dataset (originally ~80M reads, ~25Gbp) for binning. I was hoping to include more short read sets from additional timepoints at some point too. Is there too much data? Do you think the 115 MAGs I've got out is complete or does the failed iteration mean I've missed out on some? I ended up just creating the dummy 'done' file in the metabat2_refined directory to make the rest of the pipeline run from there as I did have some MAGs. On a side note, I also ran it with maximum memory (512G apparently reduced to ~502 in reality) and it failed OOM with the MaxRSS being less than what was requested. I know MaxRSS can be unreliable/inaccurate but still thought this was odd.
|
Oh okay, that's interesting... It definitely should be handling this okay, something must be going on. Is that 387MB bin an actual bin or is it the unbinned contigs? And are there any other exceptionally large bins? 5MB is a very standard size for MAGs, but 387MB is huge and would definitely cause some issues I think. |
Ahh, yes there are quite a few very large bins. There's 33 bins larger than 10MB in the metabat_bins_2 folder. All the versions of metabat1 are also producing large bins. Not quite as big as the 387MB one but many over 10MB too. I just did a quick checkm run on some of these large bins and unsurprisingly they're all highly contaminated (over 100% in some cases - not sure how that's possible), with genome sizes far larger than any known microbial genome. Not really sure what the solution is here. I will delete the excessively large bins and give it another run to see if that alleviates the memory issue at least. Do you think an additional CheckM step before the first Rosella iteration could be a potential feature addition to filter excessively large and contaminated bins prior to refinement? I can't find anything online about Metabat producing excessively large bins, but I am concerned that this is a case of garbage in, garbage out, and an issue with my data is the ultimate culprit. |
Still, 10MB should be absolutely fine. I'm talking more excessively large bins like 387MB one. Is it just Metabat producing these bins? Where are the contigs in these bins going when binned by the other binning programs? I would attempt just deleting those super large bins from the metabat output + deleting any of the rosella refine output and then letting it try again, I know it's annoying but might be best here. The refine step is there to try and break apart contaminated bins, so having a step that throws out things that look contaminated kind of defeats the purpose. It's still very odd that it would cause such extreme memory issues. If you have a way of providing me the assembly, metabat2 bins, and coverage files I'd be happy to have a look at this more closely to try and figure out what is happening. |
Only metabat2 seems to be making giant ones, another sample has a 66MB bin from metabat2. The metabat1 variants make some bigger ones (63-80MB) and these rules are notably not running out of memory. Had a quick look at the rosella bins as an example and the contigs from the 387MB metabat2 bin are found in many different rosella bins as well as unbinned which also obviously a concern. I'll try what you've suggested and see what happens. I'll send you a data link via email too. Cheers! |
You can also tell aviary to skip metabat2 altogether for that particular sample as well, might be easiest to do that for now while I tinker |
To update - no issues with refine_metabat2 after removing the two mega bins (~380 and ~120MB) from that sample and running refine again. Also no issues when running the recover workflow with just a single set of short reads for binning against the long read assembly. Metabat2 did produce two large bins (96 and 43MB) on this run but no issues with refine. It seems throwing the kitchen sink at metabat2 (i.e. three sets of short reads and one set of long reads) could be at least part of the reason for these massive contaminated bins. I may end up just running each read set alone and then dereplicating the final bins from multiple runs to get around this. |
Hi Rhys,
I'm running aviary recover via slurm. I'm submitting the pipeline as a single job which then sends off each rule as a new job. The rule refine_metabat2 is running out of memory and being killed by slurm.
The first time it failed, I edited the rule so it would take 480G from my config instead of the default 128G (
mem_mb = lambda wildcards, attempt: max(int(config["max_memory"])*1024, 128*1024*attempt)
) but it has since failed twice with the same errors. Notably, for both the 128G and 480G runs the MaxRSS was only slightly above the requested memory. I don't have the 128G log file though as it was overwritten by the subsequent run.Looking at the refine_metabat2.log file, Rosella takes ~ 6 hours to fail on iteration 0 before successfully running with CheckM for two more iterations in about 20 mins. I've cut the CheckM output for brevity.
refine_metabat2.log
Three seconds after the log suggests it has run successfully, the .err file shows that it gets killed for being OOM.
refine_metabat2.err
The data/metabat2_refined/final_bins folder has 115 refined bins from iteration 1 and 4 from iteration 2 so it seems to have more or less worked but because the job thinks it has failed after getting killed (see cluster error below - deletes some outputs of refine_metabat2), when I go to rerun the pipeline it starts the rule again and fails again after 6-7 hours.
cluster.err
Is it likely that the first Rosella iteration failing and the excess memory requirement at the end are linked?
Thanks,
Cam
The text was updated successfully, but these errors were encountered: