Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine_metabat2 running out of memory #208

Open
CJREID opened this issue Jun 19, 2024 · 8 comments
Open

refine_metabat2 running out of memory #208

CJREID opened this issue Jun 19, 2024 · 8 comments

Comments

@CJREID
Copy link

CJREID commented Jun 19, 2024

Hi Rhys,

I'm running aviary recover via slurm. I'm submitting the pipeline as a single job which then sends off each rule as a new job. The rule refine_metabat2 is running out of memory and being killed by slurm.

JobID           JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
39665747     snakejob.+       defq    rei219         64        1   07:44:52 1-00:52:56 OUT_OF_ME+                             480G            c320 
39665747.ba+      batch                              64        1   07:44:52 1-00:52:56 OUT_OF_ME+ 113578094+ 500873864K                       c320 
39665747.ex+     extern                              64        1   07:44:52  00:00.001  COMPLETED    124080K      1092K                       c320 

The first time it failed, I edited the rule so it would take 480G from my config instead of the default 128G (mem_mb = lambda wildcards, attempt: max(int(config["max_memory"])*1024, 128*1024*attempt)) but it has since failed twice with the same errors. Notably, for both the 128G and 480G runs the MaxRSS was only slightly above the requested memory. I don't have the 128G log file though as it was overwritten by the subsequent run.

Looking at the refine_metabat2.log file, Rosella takes ~ 6 hours to fail on iteration 0 before successfully running with CheckM for two more iterations in about 20 mins. I've cut the CheckM output for brevity.

refine_metabat2.log

INFO: 08:56:20 - Refining iteration 0

INFO: 08:56:26 - Rosella refine iteration 0
[2024-06-17T22:56:26Z INFO  rosella] rosella version 0.5.3
[2024-06-17T22:56:27Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-17T22:56:27Z INFO  rosella::refine::refinery] Beginning refinement of 68 MAGs
[2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Last message from flight: Refining bins...
[2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Run in debug mode to get full output.
[2024-06-18T06:18:04Z ERROR rosella] Refine Failed with error: Flight failed with exit code: signal: 9 (SIGKILL)

INFO: 16:18:04 - CheckM iteration 0
Running CheckM on 119 bins
[2024-06-18 16:18:06] INFO: CheckM v1.1.3
[2024-06-18 16:18:06] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out data/metabat2_refined//rosella_refined_1/refined_bins/ data/metabat2_refined//rosella_refined_1/refined_bins//checkm
[2024-06-18 16:18:06] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-18 16:18:07] INFO: Identifying marker genes in 119 bins with 64 threads:

...

[2024-06-18 16:27:33] INFO: QA information written to: data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out
[2024-06-18 16:27:33] INFO: { Current stage: 0:00:07.470 || Total: 0:09:27.247 }
INFO: 16:27:34 - Ending iteration: 0

INFO: 16:27:34 - Rosella refine iteration 1
[2024-06-18T06:27:34Z INFO  rosella] rosella version 0.5.3
[2024-06-18T06:27:35Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-18T06:27:36Z INFO  rosella::refine::refinery] Beginning refinement of 17 MAGs

INFO: 16:35:18 - CheckM iteration 1
Running CheckM on 5 bins
[2024-06-18 16:35:19] INFO: CheckM v1.1.3
[2024-06-18 16:35:19] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out data/metabat2_refined//rosella_refined_2/refined_bins/ data/metabat2_refined//rosella_refined_2/refined_bins//checkm
[2024-06-18 16:35:19] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-18 16:35:19] INFO: Identifying marker genes in 5 bins with 64 threads:

...

[2024-06-18 16:37:15] INFO: QA information written to: data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out
[2024-06-18 16:37:15] INFO: { Current stage: 0:00:00.372 || Total: 0:01:55.817 }
INFO: 16:37:15 - Ending iteration: 1

INFO: 16:37:15 - Refining iteration 2

INFO: 16:37:15 - Rosella refine iteration 2
[2024-06-18T06:37:15Z INFO  rosella] rosella version 0.5.3
[2024-06-18T06:37:16Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-18T06:37:17Z INFO  rosella::refine::refinery] Beginning refinement of 2 MAGs

INFO: 16:41:02 - CheckM iteration 2
No bins to refine
Skipping refinement
Refinery finished.

Three seconds after the log suggests it has run successfully, the .err file shows that it gets killed for being OOM.

refine_metabat2.err
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616
Select jobs to execute...

[Tue Jun 18 08:56:16 2024]
rule refine_metabat2:
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log
    jobid: 0
    benchmark: benchmarks/refine_metabat2.benchmark.txt
    reason: Missing output files: data/metabat2_refined/done
    threads: 64
    resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=/tmp/rei219.39665747, runtime=4320

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_
python /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py
Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_
Not cleaning up /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py
[Tue Jun 18 16:41:05 2024]
Finished job 0.
1 of 1 steps (100%) done
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=39665747.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The data/metabat2_refined/final_bins folder has 115 refined bins from iteration 1 and 4 from iteration 2 so it seems to have more or less worked but because the job thinks it has failed after getting killed (see cluster error below - deletes some outputs of refine_metabat2), when I go to rerun the pipeline it starts the rule again and fails again after 6-7 hours.

cluster.err
 Prior to running version 0.9.1, you will need to source: 
 /apps/aviary-genome/0.9.1/aviary_env.sh 
  
 Note: This application uses snakemake and you will need to invoke a profile to 
 run the pipelines on our batch system 
 
 See https://rhysnewell.github.io/aviary/examples and HPC submission 


Loading aviary-genome/0.9.1
  Loading requirement: python/3.9.4 amazon-corretto/21.0.0.35.1 bbmap/39.01
    parallel/20210322 miniconda3/4.9.2 pigz/2.7.0 fastani/1.33 fasttree/2.1.11
    hmmer/3.4.0 gsl/2.6 capnproto-c++/0.9.1 mash/2.3 pplacer/1.1.alpha19
    prodigal/2.6.3 gtdbtk/2.3.2
Warning: '/apps/aviary-genome/0.9.1/conda_pkgs' already in 'pkgs_dirs' list, moving to the top
Warning: '/apps/aviary-genome/0.9.1/conda_envs' already in 'envs_dirs' list, moving to the top
06/18/2024 08:56:09 AM INFO: Time - 08:56:09 18-06-2024
06/18/2024 08:56:09 AM INFO: Command - /apps/aviary-genome/0.9.1/bin/aviary recover --assembly /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta -1 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R1.fastq.gz -2 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R2.fastq.gz -l /scratch3/rei219/projects/SBM/data/raw/seq/DNA/pacbio/S_hifi.fastq.gz --gtdb-path /scratch3/projects/datasets_bioref/aviary/gtdb --eggnog-db-path /scratch3/projects/datasets_bioref/aviary/eggnog --checkm2-db-path /scratch3/projects/datasets_bioref/aviary/checkm2db --singlem-metapackage-path /scratch3/projects/datasets_bioref/aviary/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb --tmpdir /scratch3/rei219/aviary_tmp --snakemake-profile /scratch3/rei219/.config/snakemake/slurm --conda-prefix /scratch3/rei219/.conda --semibin-model soil --skip-abundances --skip-binner vamb -z hifi -n 64 -t 64 -m 480 -o /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg
06/18/2024 08:56:09 AM INFO: Version - 0.9.1
06/18/2024 08:56:09 AM INFO: Configuration file written to /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml
06/18/2024 08:56:09 AM INFO: Executing: snakemake --snakefile /apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile --directory /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg --cores 64 --rerun-incomplete --keep-going  --rerun-triggers mtime --configfile /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml --nolock --profile /scratch3/rei219/.config/snakemake/slurm --retries 0 --conda-frontend mamba --resources mem_mb=491520   --use-conda --conda-prefix /scratch3/rei219/.conda   recover_mags
Using profile /scratch3/rei219/.config/snakemake/slurm for setting default command line arguments.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cluster nodes: 100
Provided resources: mem_mb=491520
Job stats:
job                 count
----------------  -------
checkm2                 1
checkm_das_tool         1
das_tool                1
finalise_stats          1
gtdbtk                  1
recover_mags            1
refine_dastool          1
refine_metabat2         1
singlem_appraise        1
total                   9

Select jobs to execute...

[Tue Jun 18 08:56:13 2024]
rule refine_metabat2:
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log
    jobid: 5
    benchmark: benchmarks/refine_metabat2.benchmark.txt
    reason: Missing output files: data/metabat2_refined/done
    threads: 64
    resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=<TBD>, runtime=4320

Submitted job 5 with external jobid '39665747'.
[Tue Jun 18 16:41:10 2024]
Error in rule refine_metabat2:
    jobid: 5
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log (check log file(s) for error details)
    conda-env: /scratch3/rei219/.conda/f66cf05325a05bc6eeb3592400331d65_
    cluster_jobid: 39665747

Error executing rule refine_metabat2 on cluster (jobid: 5, external: 39665747, jobscript: /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/tmp.uv_g70nb/snakejob.refine_metabat2.5.sh). For error details see the cluster log and the log files of the involved rule(s).
Removing output files of failed job refine_metabat2 since they might be corrupted:
data/metabat2_refined/done
Exiting because a job execution failed. Look above for error message
chmod g+w /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/log/2024-06-18T085610.315172.snakemake.log
Complete log: .snakemake/log/2024-06-18T085610.315172.snakemake.log
06/18/2024 04:41:21 PM CRITICAL: Command '['snakemake', '--snakefile', '/apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile', '--directory', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg', '--cores', '64', '--rerun-incomplete', '--keep-going', '--rerun-triggers', 'mtime', '--configfile', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml', '--nolock', '--profile', '/scratch3/rei219/.config/snakemake/slurm', '--retries', '0', '--conda-frontend', 'mamba', '--resources', 'mem_mb=491520', '--use-conda', '--conda-prefix', '/scratch3/rei219/.conda', 'recover_mags']' returned non-zero exit status 1.

Is it likely that the first Rosella iteration failing and the excess memory requirement at the end are linked?

Thanks,

Cam

@CJREID CJREID changed the title refine_metabat2/rosella running out of memory refine_metabat2 running out of memory Jun 19, 2024
@rhysnewell
Copy link
Owner

Hey Cameron,

Thanks for using Aviary and apologies you've come across this issue. Seems like the refinement step crashing on the first iteration is due to out of memory issues, which then don't get properly exited until later on in the pipeline which is interesting. How big are the MAGs that are being produced by metabat2 and how many reads are you using for binning?

It might be that I underestimated how memory intensive some refinement processes can get if provided a lot of samples, but we shall see.

Cheers,
Rhys

@CJREID
Copy link
Author

CJREID commented Jun 24, 2024

Hey Rhys,

Thanks for getting back to me!

Metabat2 produced 397 bins, largest one being 387MB, with an average of about 5MB. I used one long read (PacBio HiFi ~2M reads ~16Gbp) and one short read dataset (originally ~80M reads, ~25Gbp) for binning. I was hoping to include more short read sets from additional timepoints at some point too. Is there too much data?

Do you think the 115 MAGs I've got out is complete or does the failed iteration mean I've missed out on some? I ended up just creating the dummy 'done' file in the metabat2_refined directory to make the rest of the pipeline run from there as I did have some MAGs.

On a side note, I also ran it with maximum memory (512G apparently reduced to ~502 in reality) and it failed OOM with the MaxRSS being less than what was requested. I know MaxRSS can be unreliable/inaccurate but still thought this was odd.

JobID           JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
42212417     smk.refin+       defq    rei219         64        1   09:19:55 1-03:22:25 OUT_OF_ME+                          502116M            c184 
42212417.ba+      batch                              64        1   09:19:55 1-03:22:25 OUT_OF_ME+ 114018829+ 495716879K                       c184 
42212417.ex+     extern                              64        1   09:19:55  00:00.001  COMPLETED    124080K      1040K                       c184 

@rhysnewell
Copy link
Owner

Oh okay, that's interesting... It definitely should be handling this okay, something must be going on. Is that 387MB bin an actual bin or is it the unbinned contigs? And are there any other exceptionally large bins? 5MB is a very standard size for MAGs, but 387MB is huge and would definitely cause some issues I think.

@CJREID
Copy link
Author

CJREID commented Jun 27, 2024

Ahh, yes there are quite a few very large bins. There's 33 bins larger than 10MB in the metabat_bins_2 folder. All the versions of metabat1 are also producing large bins. Not quite as big as the 387MB one but many over 10MB too.

I just did a quick checkm run on some of these large bins and unsurprisingly they're all highly contaminated (over 100% in some cases - not sure how that's possible), with genome sizes far larger than any known microbial genome.

Not really sure what the solution is here. I will delete the excessively large bins and give it another run to see if that alleviates the memory issue at least. Do you think an additional CheckM step before the first Rosella iteration could be a potential feature addition to filter excessively large and contaminated bins prior to refinement?

I can't find anything online about Metabat producing excessively large bins, but I am concerned that this is a case of garbage in, garbage out, and an issue with my data is the ultimate culprit.

@rhysnewell
Copy link
Owner

Still, 10MB should be absolutely fine. I'm talking more excessively large bins like 387MB one. Is it just Metabat producing these bins? Where are the contigs in these bins going when binned by the other binning programs? I would attempt just deleting those super large bins from the metabat output + deleting any of the rosella refine output and then letting it try again, I know it's annoying but might be best here.

The refine step is there to try and break apart contaminated bins, so having a step that throws out things that look contaminated kind of defeats the purpose. It's still very odd that it would cause such extreme memory issues. If you have a way of providing me the assembly, metabat2 bins, and coverage files I'd be happy to have a look at this more closely to try and figure out what is happening.

@CJREID
Copy link
Author

CJREID commented Jul 1, 2024

Only metabat2 seems to be making giant ones, another sample has a 66MB bin from metabat2. The metabat1 variants make some bigger ones (63-80MB) and these rules are notably not running out of memory. Had a quick look at the rosella bins as an example and the contigs from the 387MB metabat2 bin are found in many different rosella bins as well as unbinned which also obviously a concern.

I'll try what you've suggested and see what happens. I'll send you a data link via email too.

Cheers!

@rhysnewell
Copy link
Owner

You can also tell aviary to skip metabat2 altogether for that particular sample as well, might be easiest to do that for now while I tinker

@CJREID
Copy link
Author

CJREID commented Jul 2, 2024

To update - no issues with refine_metabat2 after removing the two mega bins (~380 and ~120MB) from that sample and running refine again. Also no issues when running the recover workflow with just a single set of short reads for binning against the long read assembly. Metabat2 did produce two large bins (96 and 43MB) on this run but no issues with refine. It seems throwing the kitchen sink at metabat2 (i.e. three sets of short reads and one set of long reads) could be at least part of the reason for these massive contaminated bins. I may end up just running each read set alone and then dereplicating the final bins from multiple runs to get around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants