Next generation sequencing and bioinformatic and genomic analysis at CDPHE is not CLIA validated at this time. These workflows and their outputs are not to be used for diagnostic purposes and should only be used for public health action and surveillance purposes. CDPHE is not responsible for the incorrect or inappropriate use of these workflows or their results.
This repo contains CDPHE's workflow (influenza_assembly.wdl
& influenza_assembly_summary.wdl
) for the whole genome assembly and anlaysis of Influenza A and B clincal specimens or grown viral isolates. This workflow is written in WDL and can be ran on the Terra.bio platform. The workflow is availabe on Dockerstore: influenza assembly and influenza assembly summary
Sequencing data is generated using an amplicon sequencing approach with the unviersal Influenza A and universal Influenza B primers which bind to highly conserved regions at the terminal 3' and 5' ends of each gene segment. Currently this workflow only accepts 2x150 paired end Illumina read data.
Breifly our workflow accomplishes the following:
-
Read Cleaning and Filtering. Seqyclean is used to filter reads on quailty and length. Fastqc is used to generate metrics on sequencing reads including the number of reads.
-
Assembly. For assembly we use the software IRMA (Iterative Refinement Meta-Assembler) developed and written by the CDC. More information about IRMA can be found on the CDC irma webpage and MIRA webpage. Addtional details regarding the use of IRMA in this workflow can be found under the IRMA heading.
-
Calculate Post Assembly Metrics. The number of reads mapped is determined using the
READ_COUNTS.txt
file output from irma and theconcat_assembly_qc_metrics.py
script. The mean sequencing depth across each gene segment is calculated using the bam file generated by IRMA andsamtools coverage
. The percent coverage of each gene segment are calculated using the consensus sequences generated by IRMA and thecalculate_percent_coverage.py
script. -
Perform Clade Assignment using Nextclade. Assign clades to the HA and NA gene segments using Nextclade.
-
Generate Summary Report. The summary report combines all metrics including type and subtype as determined by IRMA, raw number of reads, the percent coverage, mean depth, and reads mapped for each gene segment, the percent of flu mapped reads, and clade assignments into a single tsv output file. See
{project_name}_sequencing_results.csv
output for more details. -
Data Transfer. Intermediate files and outputs are transfered to Google Cloud Platform buckets.
Numbers 1-5 and 6 occur in influenza_assembly.wdl
and Numbers 5-6 occur in influenza_assembly_summary.wdl
.
sample_name
: string (underscores and dashes are ok)
project_name
: string
fastq_R1
: file
fastq_R2
: file
contam_fasta
* : file (a copy of this file is in the inputs directory of this repo)
out_bucket_path
: string
python scripts*:
concat_preprocess_qc_metrics_py
,
irma_subtyping_results_py
,
calc_percent_coverage_py
,
concat_assemlby_qc_metrics_py
,
results_summary_py
,
caputre_version_py
,
capture_version_summary_py
*These files should be added to the terra workspace data
-
{project_name}_sequencing_results.csv
column descriptions:analysis_date
- the date the workflow was runflu_type
- A, B, or mixed determined by IRMAHA_subtype
/NA_subtype
- For Inf A, H1-H18, and N1-N8HA_clade
/HA_subclade
- HA clade and subclade determined by nextcladeNA_clade
- NA clade determined by nextcladecompleted_segments
- number of segments assembledassembled_segments
- number of segments assembed with at least 90% coverage at 30x depthfiltered_reads
- the number of filtered reads based on quality and length determined by IRMAmapped_reads
- the number of reads mapped(totaled across all gene segments) determined by IRMApercent_mapped_reads
- mapped reads/filtered reads x 100{segment}_percent_coverage
- the number of non-N bases divided by the length of the seed reference gene segment used by IRMA. The reference gene segments can be found in the IRMA files.{segment}_mean_depth
- the mean depth across the gene segment determined by using samtools{segment}_mapped_reads
- the number of mapped reads to the gene segment determined by IRMA -
Fasta File Headers:
Fasta file headers are renamed from the original IRMA output as follows:*
> {sample_name}_{TYPE}_{SEGMENT-SUBYTPE}
# examples
>{sample_name}_A_HA-H1
>{sample_name}_A_NA-N2
>{sample_name}_A_PB2
>{sample_name}_B_HA
>{sample_name}_B_NA
>{sample_name}_B_NP
*by renaming the fasta headers the fasta header in the consensus file does not match the reference fasta header in the bam files. To create a match the fasta headers should be changed to match the IRMA format. Examples:
>A_HA_H1
>A_NA_N2
>A_PB2
>B_HA
>B_NA
>B_NP
- Output directory stucture:
├── gs://{out_bucket_path}
# preprocessing
│ ├── fastqc_raw
│ │ ├── {sample_name}_R1_fastqc.html
│ │ ├── {sample_name}_R1_fastqc.zip
│ │ ├── {sample_name}_R2_fastqc.html
│ │ ├── {sample_name}_R2_fsatqc.zip
| ├── fastqc_clean
│ │ ├── {sample_name}_R1_fastqc.html
│ │ ├── {sample_name}_R1_fastqc.zip
│ │ ├── {sample_name}_R2_fastqc.html
│ │ ├── {sample_name}_R2_fsatqc.zip
| ├── seqyclean
│ │ ├── {sample_name}_clean_SummaryStatistics.tsv
# irma
| ├── irma_assembly_mutlifasta
| | |──{sample_name}_all_assembled_segments.fasta
| ├── irma_assembly_results
| | |──{sample_name}_irma_assembled_gene_segments.csv
| ├── irma_alignments
| | |──{sample_name}
| | | |──{sample_name}_{flu_type}_{segment-subtype}.bam
| | | |──{sample_name}_A_HA-H1.bam
| | | |──{sample_name}_A_PB1.bam
| ├── irma_assemblies
| | |──{sample_name}
| | | |──{sample_name}_{flu_type}_{segment-subtype}_irma.fasta
| | | |──{sample_name}_A_HA-H1_irma.fasta
| | | |──{sample_name}_A_PB1_irma.fasta
| ├── irma_vcfs
| | |──{sample_name}
| | | |──{sample_name}_{flu_type}_{segment-subtype}.vcf
| | | |──{sample_name}_A_HA-H1.vcf
| | | |──{sample_name}_A_PB1.vcf
| ├── irma_logs
| | |──{sample_name}_READ_COUNTS.txt
| | |──{sample_name}_run_info.txt
# post assembly
| ├── sorted_bams
| | |──{sample_name}
| | | |──{sample_name}_{flu_type}_{segment-subtype}.sorted.bam
| | | |──{sample_name}_A_HA-H1.sorted.bam
| | | |──{sample_name}_A_PB1.sorted.bam
# nextclade
| ├── nextclade_out
| | |──{sample_name} #repeat for each sample
| | | |──{sample_name}_na_nextclade.json
| | | |──{sample_name}_na_nextclade.tsv
| | | |──{sample_name}_na_translation.fasta
| | | |──{sample_name}_ha_nextclade.json
| | | |──{sample_name}_ha_nextclade.tsv
| | | |──{sample_name}_ha_HA1_translation.fasta
| | | |──{sample_name}_ha_HA2_translation.fasta
| | | |──{sample_name}_ha_SigPep_translation.fasta
| | | |──{sample_name}_ha_translation.fasta # only for H5
# summary results
│ ├── summary_results
| | ├── {project_name}_sequencing_results.csv
# version capture
│ ├── version_capture
| | ├── sample level version capture files (file for each sample)
| | |── set level version capture files
This section describes the parameters we adjusted when running IRMA and describes the outputs of key files.
We modify the default config file to include the following parameters:
MIN_CONS_SUPPORT="50"
# Minimum allele coverage depth to call plurality consensus, otherwise calls "N". Setting this value too high can negatively impact final amended consensus. 50 is what is used in CDC's MIRA.
MIN_LEN="70"
# Minimum read length to include reads in read gathering. This value should not be greater than the typical read length. We lowered our min length because basespace performs adapter trimming which reduces the read length from the expected 2x150.
DEL_TYPE="DEL"
# Advanced option. If sites are completely missing during read gathering use the reference seed (REF), delete by ambiguation (NNN), or just remove (DEL). Default is old behavior: Uses "NNN" with BLAT and "DEL" otherwise. Can specify per round with space delimiter.
The following are the IRMA output files that we transfer as intermediate files and use for calculating sequencing metrics.
-
Consensus Sequences We pull the fasta files from the
amended_consensus
directory. These fasta files will have theMIN_CONS_SUPPORT
parameter applied. We change all IUPAC ambigous bases to N and all periods to N. We remove all dashes as these represent deletions. We rename the header to e.g.{sample_name}_A_HA-H1
or e.g.{sample_name}_A_NP
. Note - Bases called in the fasta files in the main IRMA output directory require only 1x depth, as such these files are not transfered are used. -
Bam Files The reference file used in the bam file is the final consensus sequence.
-
VCF Files The reference file used in the bam file is the final consensus sequence.
-
READS_COUNTS.txt
Located in thetables
directory, this file contains the number of intial reads, filtered reads, mapped reads and alt mapped reads (e.g. reads that mapped to alternative flu subtype). -
run_info.txt
Located in thelogs
directory, this file lists all the assembly parameters used by IRMA, including those we adjusted.