-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GATK's MarkDuplicatesSpark outputs and empty bam file for an ICGC cohort of WGS FASTQs #223
Comments
Additional Context: The complete sample has 5-6 pairs of fastqs. Per Taka's recommendation we tested with one pair (R1 and R2) and still the same issue. This suggests that the problem is not related to having multiple fastqs or running out of disk space. The current input.csv used for testing has two rows corresponding to two pairs. Feel free to remove the second row so that it goes faster when testing. |
@Alfredo-Enrique have you guys run the command with
|
Just ran it manually to test out on the upstream intermediates. Input for this test is a 7GB bam file made of two fastq pairs of this specific cohort and which has already been sorted by the pipeline and is the intermediate before SparkMarkDuplicates step. GATK version used: Output of command above: |
Hmm, can you try two more things and point me to the logs and outputs of the NF runs?
|
Tried it; Result: No difference. Still empty bam. Main Nextflow log: Outputs Folder: Process Logs: Nextlfow log folders: Random detail, I mislabeled the sample name suffix in the config file regards to details about the modified parameters. I changed the name of the output folder to reduce confusion, but the logs might still have the incorrect name, so just ignore it. Correct name: samtools view -c of final bam samtools view -c of intermediate picartds_tool_sorted.bam
Tried it; Result: Also no difference. Still empty bam. Main Nextflow log: Outputs Folder: Process Logs: Nextlfow log folders: samtools view - c main_output.bam samtools view -c of the samtoolsorted_intermediate.bam |
My working hypothesis. Problem originates with FASTQ headers. Example of a fastq from this cohort:
Compared to CPCG0196
Granted, they are different illumina fastq versions. However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads. I tried going through the MarkDuplicatesSpark GATK code and samtools_library , but it's like 5-6 layers deep of different java scripts and functions called so couldn't figure it out. |
Tried modifiying an input fastq by removing the unknown characters; Didn't work despite going through workflow to completion and outputting a bam file. Details: I extracted a single fastq read (including Quality Scores) and the corresponding pair from the read1 and read2 files. Example here:
Then I modified the "1@" as below and made an input csv through run through the pipeline.
Output: To speed up testing: Here's an input.csv with the single pair unmodified raw: |
Thanks @Alfredo-Enrique. I am testing with this sample now and I'm also getting a BAM with 0 aligned reads when using mark duplicates spark. I'm working through Taka's suggestions above and will let you know if I can get any aligned reads. |
Do we have any other samples/dataset that follow the header format for the failing samples? If so, we can look into testing those to confirm whether it's a header-related issue. If we don't, we could try converting the headers for the failing samples to the format that CPCG0196 uses and use that as a test to confirm whether the header is the problem. |
It's the header. FOR SURE. Good problem solving gang. Shoutout to @tyamaguchi-ucla for pointing me to this: https://github.com/uclahs-cds/tool-register-dataset/issues/45 Tested with the single paired read from above, and just modified the header to be like ones from SRA file in the issue 45, and changed the lenght. It worked and went through to completion with Spark On One of the pairs of modified fastq. Same from example in issue, just header changed:
Samtools view -c output shows the 2 reads: |
Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers? @tyamaguchi-ucla |
Additional update: Taka suggested I tried "samtools markdup". I tried it on the intermediate. Unlike GATK MarkDuplicatesSpark this does work and does give an output. Had to follow a different worfklow of fix mate pairs > resort based on position > samtools mark dup but it ended up working. Ran it on the sorted intermediates. This is a fastq pair of interest without the modified headers. |
Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs. Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample. |
I have a few suggestions. @Alfredo-Enrique
We could look into it but it'll take some time to implement it and samtools might have some edge cases.
@Alfredo-Enrique what's the total size of the FASTQs? Maybe around 70GB but we haven't confirmed. CPCG0196-F1 is about 400GB in total but it went through the pipeline. |
Answer (2 years ago) from GATK Team:
|
Link to created GATK issue broadinstitute/gatk#8134 |
@tyamaguchi-ucla @jarbet
they also made a suggestion for using the
|
Really quick response! Here's how to check the BAMs. We still have the intermediate BAMs aligned using BWA-MEM2 On one of the F2 nodes, load SAMtools by tying https://samtools.github.io/hts-specs/SAMv1.pdf (if you want to know more about SAM format) It looks like the
I haven't looked into this option before but we can try that. @Alfredo-Enrique do you want to try this out when you get a chance? |
I think this is actually fine because each line contains paired end information in the BAM. (both forward |
Thanks! So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right? |
We use either BWA-MEM2 or HISAT2 (aligner) to align reads against the reference genome. See https://github.com/uclahs-cds/pipeline-align-DNA |
You can check what tools were used in the BAM header(
|
I updated the information for the issue on GATK and said they will look into a possible fix. |
Thank you, @madisonjordan! Very promising response! |
Describe the bug
I have an ICGC cohort of WGS FASTQs. When run through the align-DNA pipeline they output an empty bam file with no reads. The pipeline goes through completion without any error or issues. When looking at the upstream intermediates, the reads are there and the bam files are of the corresponding size, going all the way to MarkDuplicatesSpark step. Here, the output is an empty (47K) bam file with decoys, but 0 reads. This was confirmed by
samtools view -c
Did some testing. Took the output from the immediate step before MarkDuplicatesSpark, which is sorting, and manually ran MarkDuplicatesSpark with reduced flags; it ran and launched, but it also outputted an empty bam file without any error flags. Tried different dockers from various GATK releases and same result, empty bam file.
So possibilities 1) it could be that there's something specific about these files that causes the GATK MarkDuplicatesSpark option to output an empty file, or 2) GATK is fine and there is a bug introduced at the preceding steps that causes MarkDuplicatesSpark to output an empty bam.
python /hot/user/alfgonzalez/utility/tool-submit-nf/submit_nextflow_pipeline.py --nextflow_script /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/align-DNA.nf --nextflow_config /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/config/template.config --partition_type F72 --pipeline_run_name "TEST-align-DNA_V3.0_Metapipeline_8_Enable_true_bug_replication" --email [email protected]
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Will get an empty bam file with no error-outs and with 0 reads.
Screenshots
N/A
Additional context
The text was updated successfully, but these errors were encountered: