-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229
Comments
It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge. |
Also, it looks like no major updates for MarkDuplicatesSpark since 4.2.4.1 (current) (the latest is 4.2.6.1) |
Yeah barring special nodes with expanded disk space or moving MarkDuplicatesSpark to run per library one by one, it'd be hard to fix this problem. |
I think we want to implement #234 in the long run but we could also try
|
Currently testing with |
@jarbet Did changing the compression level help? I'm running into the same issue with a subset of CPCG. It looks like samples with total fastq size > ~400Gb will fail with the current Spark configuration. The fastq size distribution of CPCG overlaps with this ~400Gb limit with ~1/3 of the cohort being too large. I was trying to monitor scratch usage, but the intermediate files generated by Spark are assigned to
Not sure if there's a way to properly map the users so this doesn't happen, but this is probably low priority. |
@tyamaguchi-ucla mentioned that it could potentially be possible to have Spark parallelize less, theoretically reducing data copying and scratch usage. The parameters for this are located in the F72.config and not template.config or default.config. @yashpatel6 would reducing the number of cpus allowed for the If not, it looks like ~1/3 of CPCG will need to be run without Spark or dependent upon upgrades to our F72 scratch size. |
If we can conclude that the only way around this is by increasing scratch size, I can write up a cost-benefit analysis of upgrading scratch vs. having to run larger samples with Picard and send it to Paul. It looks like the current metapipeline bottlenecks are scratch space during align-DNA mark duplicates Spark and call-gSNP recalibrate/reheader steps so it might be necessary to expand scratch regardless unless we can make optimizations at both of these steps. |
Describe the bug
Pipeline failed when testing on
CPCG0196-F1
, givingerror exit status (3)
forrun_MarkDuplicatesSpark_GATK
. First noticed here.Testing info/results:
BWA-MEM2 (failed after 19 hours)
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/testing_CPCG0196-F1.sh
CPCG0196-F1
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
HISAT2 (failed after 22 hours)
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log
Note that
BWA-MEM2
andHISAT2
give slightly different error messages. Both say the following:But only
HISAT2
says the following (several times) in regards torun_MarkDuplicatesSpark_GATK
:The text was updated successfully, but these errors were encountered: