-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runs fail because not enough memory for MarkDuplicates | How much memory should I give it? #221
Comments
Can you remind us why you can't use Spark for the samples? |
For this cohort WGSAIGLO (previously ICGC-BOCA-UK) MarkDuplicateSparks outputs an empty bam file with no reads. It seems to be a GATK specific thing. Current working hypothesis is that it has something to do with the reads or a prefix/suffix in the read name. I haven't had a chance to do an issue write up but I'll submit it soon. But essentially samples go through the whole pipeline without any flags or errors, and just outputs an empty bam (48K size file) with decoys but no actual read as confirmed by samtools. Upstream files seem okay based on viewing with samtools and picard ValidateSam. Tried manually running MarkDuplicatesSpark using different versions of GATK (from official dockers) on the intermediates going into this step and still get empty bams. Hence why trying Spark = off but getting errors in memory allocation and trying to figure that out so I can run this samples. |
@Alfredo-Enrique yes, please do so asap and can you point us to the logs, intermediate files, etc? I wouldn't recommend picard because it's single-threaded and very slow for large samples like WGS. Did you get a chance to test Spark with a subset of the failed/empty sample (like just one lane of R1 and R2 or you can even subset of them to do a quick test) as we discussed on Monday? |
Issue #223 has been created for MarkDuplicatesSpark specific issue. Tested with one pair (R1 and R2) and same issue. It's not related to having multiple fastqs or storage capacity. I'll add this comment to the other issue to include the info in that thread. |
Update + Question:Update : So not being able to run with Spark, I set spark=false and ran with all the fastq pairs for this specific sample and removed the memory allocation info in the methods.config. It failed with 137 out of memory error again. Failed after 2 days, 16hr, 24min. Question: in theory if I remove the memory specifications for the F72.config, it should give the maximum amount of memory during the mark duplicates step, right? Meaning I don't need to write Additional Info for last test following issue template guidelines:
To Reproduce
Expected behavior Additional context
Screenshots
|
I have current ongoing test run that I launched where I explicitly stated close to the maximum memory allocation to check if that will work. Will report back as soon as it's done. Excerpt of F72.config from current test run
|
Hi Alfredo, did the test run work when you gave |
Hi Jaron! Apologies, forgot to follow up. And Nope. It failed. @tyamaguchi-ucla hypothesized the issue probably was not a memory issue as the final file should be around 70GB, and that it might be a fastq header issue since it's the same files from #223 Spark empty bam file issue. However when I manually run Picard's mark duplicates it outputs a file without issue. Haven't been able to debug more since. Question @jarbet and/or @yashpatel6. Have either of you had the opportunity to due a test run pipeline-align-DNA v8.0.0 or newer commit, with a large file like CPCG and Picard Mark Duplicates? Wondering if the out of memory error is a dataset specific thing on my end or a general issue. |
Hi @Alfredo-Enrique , apologies for not getting back to you on this. I have not run pipeline-align-DNA on a large sample using @tyamaguchi-ucla mentioned that the pipeline was run on
So Picard's mark duplicates works when you run it manually on the sample, but fails when using it in the pipeline with max memory? Have you been able to make any more progress on debugging this recently? If we cannot get Picard's mark duplicates to work, then an alternative option maybe worth trying would be to manually fix the header issue and then try using |
Describe the bug
I have a cohort of Whole Genome (WGS) fastqs from ICGC in which I can't use Spark MarkDuplicates. So I need to have Spark = false. However when I have it set to false I get the 137 out of memory error during picard markduplicates step. Any recommendations on how to go about figuring out how much memory to give?
I tried 50 GB (Failed out at 1day 11hours) and tried 100GB (Failed out at 2days 10hours). Trying 136GB right now but don't know if it will break anything going that high.
Error:
base_output_dir: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-
The text was updated successfully, but these errors were encountered: