This pipeline is used to estimate the relative abundances of reads originating from specific taxa in a sequenced library. The taxonomic level of interest can be specified using the kraken2
labels:
O
: OrderF
: FamilyG
: GenusS
: SpeciesS1
: One level below species
The NCBI taxonomy is used to define the taxa.
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--outdir <output_dir>
Alternatively, a 'samplesheet.csv' file may be provided with fields ID, R1, R2:
ID,R1,R2
sample-01,/path/to/sample-01_R1.fastq.gz,/path/to/sample-01_R2.fastq.gz
sample-02,/path/to/sample-02_R1.fastq.gz,/path/to/sample-02_R2.fastq.gz
...
nextflow run BCCDC-PHL/taxon-abundance \
--samplesheet_input </path/to/samplesheet.csv> \
--outdir </path/to/outdir>
By default, results will be summarized at the species level (S
). Other taxonomic levels can be specified using the --taxonomic_level
flag. For example, to collect results for the genus level (G
):
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--taxonomic_level G \
--outdir <output_dir>
Multiple taxonomic levels may be provided in a comma-separated list. For example:
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--taxonomic_level G,S \
--outdir <output_dir>
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--taxonomic_level S,S1 \
--outdir <output_dir>
Reads can be binned by taxonomic group, and extracted to separate output files using the --extract_reads
flag.
When using this flag, a threshold is applied on the percentage of reads assigned to the taxonomic group, below which
reads are not extracted. The default threshold is 1%. It can be modified using the --extract_reads_threshold
flag.
For example, to extract reads for any taxonomic group present at 0.1% or more:
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--extract_reads \
--extract_reads_threshold 0.1 \
--outdir </path/to/outdir>
By default, bracken is used to re-estimate the read abundances for each taxonomic group, at a specific taxonomic level (Genus, Species, etc.).
If desired, bracken can be skipped with the --skip_bracken
flag:
nextflow run BCCDC-PHL/taxon-abundance \
--fastq_input <fastq_input_dir> \
--skip_bracken \
--outdir </path/to/outdir>
When the --skip_bracken
flag is used, abundances will be calculated directly from the kraken2 report. Note that the abundance
estimates directly from kraken2 reports may under-estimate the actual abundances. Detailed rationale for including bracken analysis
can be found in the bracken paper.
By default, separate output files will be created for each sample, in independent output sub-directories under the directory provided with the --outdir
flag.
To generate summary files for all samples, add the --collect_outputs
flag. The following files will be written to the output dir:
collected_<taxonomic_level>_<kraken|bracken>_abundances.csv
collected_<taxonomic_level>_<kraken|bracken>_abundances_top_5.csv
collected_fastp.csv
An alternative prefix (instead of collected
) may be supplied using the --collected_outputs_prefix
flag. For example, with --collected_outputs_prefix test
:
test_<taxonomic_level>_<kraken|bracken>_abundances.csv
test_<taxonomic_level>_<kraken|bracken>_abundances_top_5.csv
test_fastp.csv
An output directory will be created for each sample. Within those directories,
<sample_id>
├── <sample_id>_fastp.csv
├── <sample_id>_fastp.json
├── <sample_id>_kraken2_report.txt
├── <sample_id>_S_bracken_abundances.csv
└── <sample_id>_S_top_5.csv
If the --extract_reads
flag is used, a directory named extracted_reads_by_taxid
is created, with sub-directories named using the NCBI taxonomy ID that those reads were assigned to.
<sample_id>
├── extracted_reads_by_taxid
│ ├── 0
│ │ ├── <sample_id>-taxid-0_R1.fastq.gz
│ │ └── <sample_id>-taxid-0_R2.fastq.gz
│ └── 1768
│ ├── <sample_id>-taxid-1768_R1.fastq.gz
│ └── <sample_id>-taxid-1768_R2.fastq.gz
├── <sample_id>_fastp.csv
├── <sample_id>_fastp.json
├── <sample_id>_kraken2_report.txt
├── <sample_id>_S_bracken_abundances.csv
└── <sample_id>_S_top_5.csv
The main outputs are:
<sample_id>_<taxonomic_level>_top_5.csv
sample_id,taxonomy_level,abundance_1_name,abundance_1_ncbi_taxonomy_id,abundance_1_num_assigned_reads,abundance_1_fraction_total_reads,abundance_2_name,abundance_2_ncbi_taxonomy_id,abundance_2_num_assigned_reads,abundance_2_fraction_total_reads,abundance_3_name,abundance_3_ncbi_taxonomy_id,abundance_3_num_assigned_reads,abundance_3_fraction_total_reads,abundance_4_name,abundance_4_ncbi_taxonomy_id,abundance_4_num_assigned_reads,abundance_4_fraction_total_reads,abundance_5_name,abundance_5_ncbi_taxonomy_id,abundance_5_num_assigned_reads,abundance_5_fraction_total_reads
DRR161190,S,Mycobacterium novum,2492438,2014826,0.96731,Mycolicibacter sinensis,875328,22690,0.01089,Mycolicibacter terrae,1788,7829,0.00376,Mycolicibacter minnesotensis,1118379,2839,0.00136,Mycolicibacter hiberniae,29314,2236,0.00107
DRR161192,S,Mycobacterium gallinarum,39689,1859062,0.99582,Mycolicibacterium rhodesiae,36814,1630,0.00087,Mycolicibacterium gadium,1794,1620,0.00087,Mycobacterium paragordonae,1389713,1255,0.00067,Mycobacterium conspicuum,44010,379,0.00020
DRR161197,S,Mycobacterium simiae,1784,2068577,0.99913,Mycobacterium cookii,1775,211,0.00010,Mycobacterium mantenii,560555,185,0.00009,Mycobacterium heidelbergense,53376,178,0.00009,Mycolicibacterium fallax,1793,146,0.00007
DRR161199,S,Mycobacterium cookii,1775,2347528,0.99881,Mycobacterium conspicuum,44010,249,0.00011,Mycobacterium tuberculosis,1773,157,0.00007,Mycobacterium kubicae,120959,157,0.00007,Klebsiella michiganensis,1134687,153,0.00007
<sample_id>_fastp.csv
sample_id,total_reads_before_filtering,total_reads_after_filtering,total_bases_before_filtering,total_bases_after_filtering,q20_bases_before_filtering,q20_bases_after_filtering,q30_bases_before_filtering,q30_bases_after_filtering,adapter_trimmed_reads,adapter_trimmed_bases
DRR161190,4221278,4196834,422127800,413313356,386783970,380201636,359261515,353924424,282820,6371132
DRR161192,3774078,3756172,377407800,369681342,345579006,339764987,321136725,316409787,248262,5936806
DRR161197,4175044,4158344,417504400,408656988,385097573,378269116,359615124,353971246,315786,7178202
DRR161199,4748928,4728502,474892800,463574412,436236542,427545934,406140193,398995239,389594,9277186
For each pipeline invocation, each sample will produce a provenance.yml
file with the following contents:
- pipeline_name: BCCDC-PHL/taxon-abundance
pipeline_version: 0.1.7
nextflow_session_id: 9b94b531-cf38-463c-9814-c30697d3aada
nextflow_run_name: lonely_ride
timestamp_analysis_start: 2024-05-13T13:45:57.620955-07:00
- input_filename: sample-01_R1.fastq.gz
sha256: 2bc7ac86b9af22533be21970e82b3b0ca36481c040a20ad70dbea65e41bbce58
- input_filename: sample-01_R2.fastq.gz
sha256: 456992cfdbeb6e0be63217d54e1c0df75b0762a0bb1fe0639420a4b72200bb4a
- process_name: fastp
tools:
- tool_name: fastp
tool_version: 0.20.1
- process_name: kraken2
tools:
- tool_name: kraken2
tool_version: 2.1.2
parameters:
- name: confidence
value: 0.0
databases:
- database_name: kraken2_standard
database_version: '20230605'
- process_name: bracken
tools:
- tool_name: bracken
tool_version: 2.6.1
parameters:
- name: read_length
value: 150
- name: taxonomic_level
value: S
The filename of the provenance file includes a timestamp with format YYYYMMDDHHMMSS
to ensure that re-analysis of the same sample will create a unique provenance.yml
file.