Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK's MarkDuplicatesSpark outputs and empty bam file for an ICGC cohort of WGS FASTQs #223

Open
Alfredo-Enrique opened this issue Jul 23, 2022 · 24 comments
Assignees
Labels
blocked Work on this issue cannot proceed because another issue must be closed first bug Something isn't working

Comments

@Alfredo-Enrique
Copy link
Member

Alfredo-Enrique commented Jul 23, 2022

Describe the bug
I have an ICGC cohort of WGS FASTQs. When run through the align-DNA pipeline they output an empty bam file with no reads. The pipeline goes through completion without any error or issues. When looking at the upstream intermediates, the reads are there and the bam files are of the corresponding size, going all the way to MarkDuplicatesSpark step. Here, the output is an empty (47K) bam file with decoys, but 0 reads. This was confirmed by samtools view -c

Did some testing. Took the output from the immediate step before MarkDuplicatesSpark, which is sorting, and manually ran MarkDuplicatesSpark with reduced flags; it ran and launched, but it also outputted an empty bam file without any error flags. Tried different dockers from various GATK releases and same result, empty bam file.

So possibilities 1) it could be that there's something specific about these files that causes the GATK MarkDuplicatesSpark option to output an empty file, or 2) GATK is fine and there is a bug introduced at the preceding steps that causes MarkDuplicatesSpark to output an empty bam.

  • Pipeline release version : 8.0.0
  • Cluster you are using (SGE/Slurm-Dev/Slurm-Test) : Slurm
  • Node type (F2s/F72s/M64s) : F72
  • Submission method (interactive/submission script) : Python submission script | Also manual runs and same thing
  • Actual submission script (python submission script, "nextflow run ...", etc.) :
    python /hot/user/alfgonzalez/utility/tool-submit-nf/submit_nextflow_pipeline.py --nextflow_script /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/align-DNA.nf --nextflow_config /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/config/template.config --partition_type F72 --pipeline_run_name "TEST-align-DNA_V3.0_Metapipeline_8_Enable_true_bug_replication" --email [email protected]
  • Sbatch or qsub command and logs if applicable | NA
  • Config files : /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/config/template.config
  • input_csv: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/inputs/align-DNA.inputs.csv
  • Path to the working directory : Was set to /scratch
  • Path to output folder: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0
  • Path to intermediates folder: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0/SP195375/BWA-MEM2-2.2.1/intermediate/
  • Any logs produced by the pipeline :
    • Main Nextflow Log: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/TEST-align-DNA_V3.0_Metapipeline_8_Enable_true_bug_replication.log
    • Process Logs folder: /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0/SP195375/log-align-DNA-8.0.0-20220718T175029Z/

To Reproduce
Steps to reproduce the behavior:

  1. Clone align-DNA 8.0.0 release into an appropriately named folder where you wish to debug
  2. Modify the template.config with the specified iput.csv above and to match the template.config from my run. Alternatively you can just copy the template.config specified above and modify the output to somewhere where you want to debug.
  3. Use python submission script to launch with desired config from above.
  4. Wait 1-2 hr to finish with no errors, and look at the outputs.

Expected behavior
Will get an empty bam file with no error-outs and with 0 reads.

Screenshots
N/A

Additional context

  • Used BWA-mem2 workflow
  • Manually used picard tools ValidateSAM to check upstream intermediate bam files and everything looks good. Same thing for final output.
  • Ran the same pipeline but changed the inputs to CPCG0196 and a-mini, and they all ran without an issue.
@Alfredo-Enrique
Copy link
Member Author

Additional Context: The complete sample has 5-6 pairs of fastqs. Per Taka's recommendation we tested with one pair (R1 and R2) and still the same issue. This suggests that the problem is not related to having multiple fastqs or running out of disk space. The current input.csv used for testing has two rows corresponding to two pairs. Feel free to remove the second row so that it goes faster when testing.

@tyamaguchi-ucla
Copy link
Contributor

@Alfredo-Enrique have you guys run the command with --VALIDATION_STRINGENCY STRICT instead of LENIENT (our default) to see if it still works? (both markduplicatesSpark and picard SortSam).

--read-validation-stringency LENIENT \

@Alfredo-Enrique
Copy link
Member Author

@Alfredo-Enrique have you guys run the command with --VALIDATION_STRINGENCY STRICT instead of LENIENT (our default) to see if it still works? (both markduplicatesSpark and picard SortSam).

--read-validation-stringency LENIENT \

Just ran it manually to test out on the upstream intermediates. Input for this test is a 7GB bam file made of two fastq pairs of this specific cohort and which has already been sorted by the pipeline and is the intermediate before SparkMarkDuplicates step.

GATK version used: 4.1.9.0 and command used:
gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT

Output of command above:
47k size file with 0 reads per samtools view -c

image

@tyamaguchi-ucla
Copy link
Contributor

Hmm, can you try two more things and point me to the logs and outputs of the NF runs?

  1. For v8.0.0, Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for both picard SortSam and markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

  2. For the main branch, we recently replaced picard SortSam with samtools sort for faster processing. Use the main branch and see if markduplicatesSpark still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

@Alfredo-Enrique
Copy link
Member Author

Alfredo-Enrique commented Jul 25, 2022

  1. For v8.0.0, Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for both picard SortSam and markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

Tried it; Result: No difference. Still empty bam.

Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/Debug-align-DNA_v8.0.0_validation_STRICT_verbosity_DEBUG_01.log

Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output

Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/process-log

Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/nextflow-log

Random detail, I mislabeled the sample name suffix in the config file regards to details about the modified parameters. I changed the name of the output folder to reduce confusion, but the logs might still have the incorrect name, so just ignore it.

Correct name: ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG
Incorrect name to ignore: ICGC_BOCA_FR_SP195375_Spark_off_SortSAM_allmem_Markduplicates_allmem

samtools view -c of final bam

image

samtools view -c of intermediate picartds_tool_sorted.bam

image


  1. For the main branch, we recently replaced picard SortSam with samtools sort for faster processing. Use the main branch and see if markduplicatesSpark still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

Tried it; Result: Also no difference. Still empty bam.

Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/Debug-align-DNA_acf77d9_validation_STRICT_verbosity_DEBUG_01.log

Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output

Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/process-log

Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/nextflow-log

samtools view - c main_output.bam

image

samtools view -c of the samtoolsorted_intermediate.bam

image

@Alfredo-Enrique
Copy link
Member Author

Alfredo-Enrique commented Jul 25, 2022

My working hypothesis. Problem originates with FASTQ headers.

Example of a fastq from this cohort:

@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
@HWI-ST700660_163:1:1101:1225:1890#1@0/1
NTCAAAGCCCTTACAGGAGAAGCTGTTATCACCCCGTTTCCAGGGGGCTGGGAACCCTGGGATATGCCCAGATAGGGCTGGGGGGGTCCTCTGGAGTCCCA
+
#1:DDFFFHHHGHDIIJGIIGHGIJJJCIHHGIIJHHIIBFHIJJIJEIHGHCA=1=?AC?>AC@A:A=CA9:CCCDC95<BB##################
@HWI-ST700660_163:1:1101:1246:1943#1@0/1
CCTTCAGGTTATAGTGTCCCCAGCCCTGGCCCTCCCGCCTGGGTCTTTCTGAGGGTCTGCGCTAGGGCATCAGAATCTGGGCTTGTAAAATTCCTTAGGTG
+
CCCFFFFFHHHHHJIJJIJJJGHIIJGGHJJJJJJJIIJGGII?DFGHGIGFCCHCD@HIEHDEFBC@BCCA>>>;>:@C???B?A>((4>>>C>C#####
@HWI-ST700660_163:1:1101:1141:1992#1@0/1
AAGAGATGGCTTCTTGTGACTATAGCTTTCAAAGTAAAAGTAGTGGCCAAAGAAAACTCTTCTGGCTGTAAAAGTAAAAGTGGACTTTCCTGAAGCTGGGC
+

Compared to CPCG0196

@H801:109:D274GACXX:2:1101:1117:2064 1:N:0:
AAGANGAATATTCCGTTCTGCATTGGCAATGTCAAAGGAGATGATGCTTTAGTAAAAAGATTTCTCGATAAAGCTTTTGAAACTCAATATGTTGTCCTTGA
+
CCCF#2=DHHHHDBCAFFHHGIIIIIIIGIIIEIIIIG>@FGHIIIIIIIICFHHIIDIGE@<FGFGIIIDCGFEAHHHFFFFFEDEE@@>@DCCCDCCCC
@H801:109:D274GACXX:2:1101:1205:2070 1:N:0:
CAAANTCACAGGTTCTGGGGATTAGGCTGTAGACATTGTAGGGTAAGGGGACATTATTATTGTGCTTACCAGCAAGGATAGCCAACGGTTCCCAGCACATA
+
<??D#2=DAD8?C+A<DEEIFA@E<EEEE1CDDID@AE*?@<?9DD>ADD?@BC4=4..@=).)7@C@?)))6@D;?;A:@####################
@H801:109:D274GACXX:2:1101:1235:2080 1:N:0:
TAGANAGAGAGAAATACTAAGAGTAAAAACTTAATTAATGAAATGGAAATCAAATATACAACAAAGTTACTATTTTAAAAGACTGATATTCATAAATCTAA
+
BCCF#2ADDHHGHJJJJJJJIIJAHHIJIEHIHIIJIJIGEGHGIJIJJIEHIJIDGIHEGACGEFFGIIFGHIBCGHEHBEFFEFFFEDEDDDCEDCDC3
@H801:109:D274GACXX:2:1101:1194:2089 1:N:0:
GATTNATTTCACTACTGTAAAACAAGCTGGGCACAGTGACTCACGCCTAGAATCCCAGGACGTTGTGAGGCTGAGACTGGGGGAATGCTTGAGTGCAGAAC
+

Granted, they are different illumina fastq versions.

However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.

I tried going through the MarkDuplicatesSpark GATK code and samtools_library , but it's like 5-6 layers deep of different java scripts and functions called so couldn't figure it out.

@Alfredo-Enrique
Copy link
Member Author

However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.

Tried modifiying an input fastq by removing the unknown characters; Didn't work despite going through workflow to completion and outputting a bam file.

Details:
I used the latest align-DNA main commit from the above examples and which has been modified to run with verbosity commands "DEBUG" and other flags as previously suggested.

I extracted a single fastq read (including Quality Scores) and the corresponding pair from the read1 and read2 files. Example here:

@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Then I modified the "1@" as below and made an input csv through run through the pipeline.

@HWI-ST700660_163:1:1101:1243:1870#0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Output:
Intermediate files appropiately have 2 reads as seen at the top of the image. Output files has 0.
image

To speed up testing: Here's an input.csv with the single pair unmodified raw: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

To speed up testing: Here's an input.csv with the single pair unmodified raw: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

Thanks @Alfredo-Enrique. I am testing with this sample now and I'm also getting a BAM with 0 aligned reads when using mark duplicates spark. I'm working through Taka's suggestions above and will let you know if I can get any aligned reads.

@yashpatel6
Copy link
Contributor

Do we have any other samples/dataset that follow the header format for the failing samples? If so, we can look into testing those to confirm whether it's a header-related issue. If we don't, we could try converting the headers for the failing samples to the format that CPCG0196 uses and use that as a test to confirm whether the header is the problem.

@Alfredo-Enrique
Copy link
Member Author

Alfredo-Enrique commented Jul 25, 2022

It's the header. FOR SURE. Good problem solving gang. Shoutout to @tyamaguchi-ucla for pointing me to this: https://github.com/uclahs-cds/tool-register-dataset/issues/45

Tested with the single paired read from above, and just modified the header to be like ones from SRA file in the issue 45, and changed the lenght. It worked and went through to completion with Spark On

One of the pairs of modified fastq. Same from example in issue, just header changed:

@SRR5456220.1 1 length=101
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+SRR5456220.1 1 length=101
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Samtools view -c output shows the 2 reads:

image

@yashpatel6
Copy link
Contributor

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers? @tyamaguchi-ucla

@Alfredo-Enrique
Copy link
Member Author

Additional update: Taka suggested I tried "samtools markdup". I tried it on the intermediate. Unlike GATK MarkDuplicatesSpark this does work and does give an output. Had to follow a different worfklow of fix mate pairs > resort based on position > samtools mark dup but it ended up working.

Ran it on the sorted intermediates. This is a fastq pair of interest without the modified headers.

image

@Alfredo-Enrique
Copy link
Member Author

Alfredo-Enrique commented Jul 25, 2022

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?

Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.

Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Jul 25, 2022

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?

I have a few suggestions. @Alfredo-Enrique

  • Create an issue at https://github.com/broadinstitute/gatk with enough details if you haven't (triple check if there are similar issues before creating an issue - Yash and I did a quick search but didn't find any). They may or may not look at it. They may respond and release a hot fix version if we're lucky.
  • Check if SRA (or other groups) has a tool to replace the FASTQ headers because they did that for all their FASTQs. We generally want to avoid tweaking the original FASTQs though..

Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.

We could look into it but it'll take some time to implement it and samtools might have some edge cases.

Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.

@Alfredo-Enrique what's the total size of the FASTQs? Maybe around 70GB but we haven't confirmed. CPCG0196-F1 is about 400GB in total but it went through the pipeline.

@tyamaguchi-ucla tyamaguchi-ucla added bug Something isn't working blocked Work on this issue cannot proceed because another issue must be closed first labels Jul 25, 2022
@madisonjordan
Copy link

Related GATK community post

Answer (2 years ago) from GATK Team:

You are correct, spaces and @ symbols are not allowed in read names with GATK. FASTQ format is more lenient than SAM/BAM format so following FASTQ format requirements does not guarantee your read names will work with GATK.

Could you print out an example read name from your SAM file to determine what name is being carried from your FASTQ file? Here is more information about what GATK expects in a SAM File.

You have a couple options to fix this issue. You could edit the read names in your FASTQ so that they will be in the expected SAM/BAM read name format which would be helpful for downstream processing. We don't have a GATK tool that will easily do this so you will need to look into outside solutions. Or, you can use the READ_NAME_REGEX option to specify to MarkDuplicates how to view optical duplicates in your file. This option may be more challenging to get working but it is available.

@madisonjordan
Copy link

Link to created GATK issue broadinstitute/gatk#8134

@madisonjordan
Copy link

madisonjordan commented Dec 20, 2022

@tyamaguchi-ucla @jarbet
There's been a response from GATK about this issue which I don't know enough to respond to.

Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam?
I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?

they also made a suggestion for using the --READ_NAME_REGEX option

By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.

@tyamaguchi-ucla
Copy link
Contributor

Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam?
I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?

Really quick response! Here's how to check the BAMs. We still have the intermediate BAMs aligned using BWA-MEM2 /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0/SP195375/BWA-MEM2-2.2.1/intermediate/align-DNA-BWA-MEM2

On one of the F2 nodes, load SAMtools by tying
module load samtools and then type
samtools view C16MNACXX.1.1-1.bam | less (or head) (make sure to use the piping to avoid printing out the whole BAM)

https://samtools.github.io/hts-specs/SAMv1.pdf (if you want to know more about SAM format)

It looks like the /1 was dropped from the readnames like HWI-ST700660_163:1:1101:1225:1890#1@0 but otherwise, the answer seems yes. It looks like the input FASTQs were moved somewhere and I couldn't find the original FASTQs. @Alfredo-Enrique can you point us to the original input FASTQs?

By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.

I haven't looked into this option before but we can try that. @Alfredo-Enrique do you want to try this out when you get a chance?

@tyamaguchi-ucla
Copy link
Contributor

It looks like the /1 was dropped from the readnames like HWI-ST700660_163:1:1101:1225:1890#1@0 but otherwise, the answer seems yes.

I think this is actually fine because each line contains paired end information in the BAM. (both forward /1 and reverse /2 reads)

@madisonjordan
Copy link

Thanks!

So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right?
I will add that information along with the new header if that's the case since it seems important for reproducibility from their end.

@tyamaguchi-ucla
Copy link
Contributor

So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right?

We use either BWA-MEM2 or HISAT2 (aligner) to align reads against the reference genome. See https://github.com/uclahs-cds/pipeline-align-DNA

@tyamaguchi-ucla
Copy link
Contributor

You can check what tools were used in the BAM header(@PG)

samtools view -H C16MNACXX.1.1-1.bam | grep PG

@PG     ID:bwa-mem2     PN:bwa-mem2     VN:2.2.1        CL:bwa-mem2 mem -t 56 -M -R @RG\tID:C16MNACXX.1.Seq1\tCN:HWI-ST700660_163\tLB:C16MNACXX.1.1\tPL:ILLUMINA\tPU:HWI-ST700660_163\tSM:SP195375 genome.fa d2ca6c9ab30faaaf8e3dbf3b7262a738.C16MNACXX_1_1_1.fastq.gz bdd9006c66e8655790451e89050f0004.C16MNACXX_1_1_2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa-mem2     VN:1.12 CL:samtools view -@ 56 -S -b
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.15.1       CL:samtools view -H C16MNACXX.1.1-1.bam

@madisonjordan
Copy link

I updated the information for the issue on GATK and said they will look into a possible fix.

@tyamaguchi-ucla
Copy link
Contributor

Thank you, @madisonjordan! Very promising response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Work on this issue cannot proceed because another issue must be closed first bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants