Bioinformatics pipeline for 454 and IonTorrent amplicon sequencing data

This page holds an overview and some example commands of how GeneLab processes 454 and IonTorrent amplicon datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the Open Science Data Repository (OSDR).

Date: January 13, 2021
Revision: -
Document Number: GL-DPPD-7106

Submitted by:
Michael D. Lee (GeneLab Analysis Team)

Approved by:
Sylvain Costes (GeneLab Project Manager)
Samrawit Gebre (GeneLab Deputy Project Manager and Interim GeneLab Configuration Manager)
Amanda Saravia-Butler (GeneLab Data Processing Lead)
Jonathan Galazka (GeneLab Project Scientist)

Software used

Program	Version*	Relevant Links
FastQC	`fastqc -v`	https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC	`multiqc -v`	https://multiqc.info/
Cutadapt	`cutadapt --version`	https://cutadapt.readthedocs.io/en/stable/
bbduk	`bbduk.sh --version`	https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/
vsearch	`vsearch --version`	https://github.com/torognes/vsearch
R	`R --version` (at command line)	https://www.r-project.org/
DECIPHER	`packageVersion("DECIPHER")`	https://bioconductor.org/packages/release/bioc/html/DECIPHER.html
biomformat	`packageVersion("biomformat")`	https://github.com/joey711/biomformat

* Exact versions utilized for a given dataset are available along with the processing commands for each specific dataset (this is due to how the system may need to be updated regularly).

Reference databases used

Program used	Database	Relevant Links
DECIPHER	SILVA SSU r138	http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData
DECIPHER	UNITE v2020	http://www2.decipher.codes/Classification/TrainingSets/UNITE_v2020_February2020.RData

General processing overview with example commands

Exact processing commands and output files listed in bold below are included with each Amplicon Seq processed dataset in the Open Science Data Repository (OSDR).

1. Raw Data QC

Uses FastQC

fastqc -o raw_fastqc_output *raw.fastq.gz

Parameter Definitions:

-o – the output directory to store results
*raw.fastq.gz – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them

Input data:

fastq, compressed or uncompressed

Output data:

fastqc.html (FastQC output html summary)
fastqc.zip (FastQC output data)

1a. Compile Raw Data QC

Uses MultiQC

multiqc -o raw_multiqc_output -n raw_multiqc -z raw_fastqc_output/

Parameter Definitions:

-o – the output directory to store results
-n – the filename prefix of results
-z – specifies to zip the output data directory
raw_fastqc_output/ – the directory holding the output data from the fastqc run, provided as a positional argument

Input data:

raw_fastqc_output/*fastqc.zip (FastQC output data)

Output data:

raw_multiqc_output/raw_multiqc_report.html (multiqc output html summary)
raw_multiqc_output/raw_multiqc_data.zip (zipped directory containing multiqc output data)

2. Trim Primers

Uses cutadapt

The location and orientation of primers in the data is important to understand in deciding how to do this step. cutadapt has many options for primer identification and removal. They are described in detail on their documentation page here: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

The following example command shows how it was done for GLDS-72, which was 454 sequencing of the 16S gene using these primers:

forward: 5’-AGAGTTTGATCCTGGCTCAG-3’
reverse: 5’- GCTGCCTCCCGTAGGAGT-3’

cutadapt -g AGAGTTTGATCCTGGCTCAG -a GCTGCCTCCCGTAGGAGT \
         -o sample-1_trimmed.fastq.gz sample-1_raw.fastq.gz

Parameter Definitions:

-g – specifies the forward primer
-a – specifies the reverse primer
-o – specifies output primer-trimmed reads
sample-1_raw.fastq.gz – positional argument specifying the input reads

Input data:

sample-1_raw.fastq.gz (raw reads)

Output data:

sample-1_trimmed.fastq.gz (primer-trimmed reads)

3. Quality filtering

Uses bbduk

bbduk.sh in=sample-1_trimmed.fastq.gz out=sample-1_filtered.fastq.gz \
         qtrim=r trimq=10 mlf=0.5 minavgquality=20 minlength=50

Parameter Definitions:

in= – the input file
out= – the output file
qtrim= – specifies the direction to trim low-quality bases from
trimq= – the minimum quality to keep while trimming low-quality bases
mlf= – allowed minimum length after trimming is half the initial read size (read filtered out if shorter)
minavgquality= – minimum average quality allowed after trimming (read filtered out if below)
minlength= – allowed minimum length (backup to mlf setting if starting read is shorter than 100 bps)

Input data:

sample-1_trimmed.fastq.gz (primer-trimmed reads)

Output data:

sample-1_filtered.fastq.gz (filtered reads)

4. Filtered Data QC

Uses FastQC

fastqc -o filtered_fastqc_output *filtered.fastq.gz

Parameter Definitions:

-o – the output directory to store results
*filtered.fastq.gz – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them

Input data:

*filtered.fastq.gz (filtered reads)

Output data:

*fastqc.html (FastQC output html summary)
*fastqc.zip (FastQC output data)

4a. Compile Filtered Data QC

Uses MultiQC

multiqc -o filtered_multiqc_output -n filtered_multiqc -z filtered_fastqc_output

Parameter Definitions:

-o – the output directory to store results
-n – the filename prefix of results
-z – specifies to zip the output data directory
filtered_fastqc_output – the directory holding the output data from the fastqc run, provided as a positional argument

Input data:

filtered_fastqc_output/*fastqc.zip (FastQC output data)

Output data:

filtered_multiqc_output/filtered_multiqc_report.html (multiqc output html summary)
filtered_multiqc_output/filtered_multiqc_data.zip (zipped directory containing multiqc output data)

5. Generating OTUs and counts per sample

Uses vsearch

5a. Dereplicate individual samples

vsearch --derep_fulllength sample-1_filtered.fastq.gz --strand both \
        --output sample-1_derep.fasta --sizeout --relabel "sample=sample-1;seq_"

Parameter Definitions:

--derep_fulllength – the vsearch command
sample-1_filtered.fastq.gz – input file, provided as a positional argument
--strand both - specifies to check both orientations
--output - designate the name of the output fasta file
--sizeout - incorporates abundance information in the sequence header
--relabel - relabel the headers of the sequences starting with this prefix

Input data:

sample-1_filtered.fastq.gz (filtered reads)

Output data:

sample-1_derep.fasta (dereplicated reads)

5b. Generate OTUs

Combining all individual sample dereplicated sequences for further processing

cat *_derep.fasta > all-samples-seqs.fasta
rm *_derep.fasta

Dereplicating combined sequences

vsearch --derep_fulllength all-samples-seqs.fasta --strand both \
        --output all-samples_derep.fasta --sizein --sizeout

Parameter Definitions:

--derep_fulllength – the vsearch command
all-samples-seqs.fasta – input file, provided as a positional argument
--strand both - specifies to check both orientations
--output - designate the name of the output fasta file
--sizein - take into account abundance information in header
--sizeout - incorporates abundance information in the sequence header

Input data:

all-samples-seqs.fasta (combined, individual-sample-dereplicated reads)

Output data:

all-samples_derep.fasta (combined dereplicated reads)

Clustering to get representative sequences

vsearch --cluster_size all-samples_derep.fasta --id 0.97 --strand both \
        --sizein --sizeout --relabel "OTU_" --centroids rep-seqs.fasta

Parameter Definitions:

--cluster_size – the vsearch command
all-samples_derep.fasta – input file, provided as a positional argument
--id - specifies the percent identity to cluster at
--strand both - specifies to consider both orientations
--sizein - take into account abundance information in header
--sizeout - incorporates abundance information in the sequence header
--relabel - relabel the headers of the sequences starting with this prefix
--centroids - designate the name of the output fasta file holding representative sequences

Input data:

all-samples_derep.fasta (combined dereplicated reads)

Output data:

rep-seqs.fasta (representative sequences)

Removing singletons

vsearch --sortbysize rep-seqs.fasta --minsize 2 --output rep-seqs-no-singletons.fasta

Parameter Definitions:

--sortbysize – the vsearch command
rep-seqs.fasta – input file, provided as a positional argument
--minsize - minimum cluster size to be retained
--output - designate the name of the output fasta file holding filtered representative sequences

Input data:

rep-seqs.fasta (representative sequences)

Output data:

rep-seqs-no-singletons.fasta (representative sequences, with singletons removed)

Chimera check and removal

vsearch --uchime_denovo rep-seqs-no-singletons.fasta --sizein --nonchimeras OTUs.fasta --relabel "OTU_"

Parameter Definitions:

--uchime_denovo – the vsearch command
rep-seqs-no-singletons.fasta – input file, provided as a positional argument
--sizein - take into account abundance information in header
--nonchimeras - designate the name of the output fasta file holding filtered representative sequences
--relabel - relabel the headers of the sequences starting with this prefix

Input data:

rep-seqs-no-singletons.fasta (representative sequences, with singletons removed)

Output data:

OTUs.fasta (representative sequences, with singletons and chimeras removed)

5c. Map reads to OTUs

vsearch --usearch_global all-samples-seqs.fasta -db OTUs.fasta --sizein \
        --id 0.97 --otutabout - | sed 's/^#OTU ID/OTU_ID/' > counts.tsv

Parameter Definitions:

--usearch_global – the vsearch command
all-samples-seqs.fasta – input file, provided as a positional argument (note this is our combined individual-dereplicated samples' fasta of reads generated in step 5b)
--db - specifies the reference OTUs to map to
--id - specifies the minimum percent identity to enable mapping a read to an OTU
--otutabout - designates the output of the count table (here going to stout in order to be modified in the subsequent sed command
| sed ... > counts.tsv - renaming the first column header and writing to counts.tsv

Input data:

all-samples-seqs.fasta (combined, individual-sample-dereplicated reads)
OTUs.fasta (representative sequences, with singletons and chimeras removed)

Output data:

counts.tsv (count table of representative sequences per sample)

6. Generating taxonomy and additional outputs

The following is performed within R.

Uses DECIPHER and biom-format

6a. Assigning taxonomy

Reading in OTU sequences:

dna <- readDNAStringSet("OTUs.fasta")

Downloading the reference R taxonomy object:

download.file(url=“http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData”,
              destfile=“SILVA_SSU_r138_2019.RData”)

Parameter Definitions:

download.file() – the function we are calling, with the following parameters set within it
url= – specifying the url address of the file to download
destfile= – specifying the path/name of the file after downloading

Loading taxonomy object:

load(“SILVA_SSU_r138_2019.RData”)

Classifying sequences:

tax_info <- IdTaxa(test=dna, trainingSet=trainingSet, strand=“both”, processors=NULL)

Parameter Definitions:

tax_info <- – specifies the variable that will store the results within in our R environment
IdTaxa() – the DECIPHER function we are calling, with the following parameters set within it
test= – specifying the “dna” object created above holding the sequences we want to classify
trainingSet= – specifying the reference database we downloaded and loaded above
strand=“both” – specifying to check taxonomy assignment in both orientations
processors=NULL – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)

6b. Generating and writing outputs

Creating table of taxonomy and setting any that are unclassified as "NA", and writing out:

ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")

tax_tab <- t(sapply(tax_info, function(x) {
  m <- match(ranks, x$rank)
  taxa <- x$taxon[m]
  taxa[startsWith(taxa, "unclassified_")] <- NA
  taxa
}))

colnames(tax_tab) <- ranks
row.names(tax_tab) <- NULL
otu_ids <- names(tax_info)
tax_tab <- data.frame("OTU_ID"=otu_ids, tax_tab, check.names=FALSE)

write.table(tax_tab, "taxonomy.tsv", sep = "\t", quote=F, row.names=FALSE)

Reading in counts table and generating additional outputs:

otu_tab <- read.table("counts.tsv", sep="\t", header=TRUE, check.names=FALSE)

    # generating and writing out biom file format
biom_object <- make_biom(data=otu_tab, observation_metadata=tax_tab)
write_biom(biom_object, "taxonomy-and-counts.biom")

    # making a tsv of combined tax and counts
tax_and_count_tab <- merge(tax_tab, otu_tab)
write.table(tax_and_count_tab, "taxonomy-and-counts.tsv", sep="\t", quote=FALSE, row.names=FALSE)

Input data:

OTUs.fasta (representative sequences, with singletons and chimeras removed)
counts.tsv (count table of representative sequences per sample)

Output data:

taxonomy.tsv (taxonomy table)
taxonomy-and-counts.tsv (table with taxonomy of representative sequences and counts per sample)
taxonomy-and-counts.biom (biom formatted file with taxonomy of representative sequences and counts per sample)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GL-DPPD-7106.md

GL-DPPD-7106.md

Bioinformatics pipeline for 454 and IonTorrent amplicon sequencing data

Table of contents

Software used

Reference databases used

General processing overview with example commands

1. Raw Data QC

1a. Compile Raw Data QC

2. Trim Primers

3. Quality filtering

4. Filtered Data QC

4a. Compile Filtered Data QC

5. Generating OTUs and counts per sample

5a. Dereplicate individual samples

5b. Generate OTUs

Combining all individual sample dereplicated sequences for further processing

Dereplicating combined sequences

Clustering to get representative sequences

Removing singletons

Chimera check and removal

5c. Map reads to OTUs

6. Generating taxonomy and additional outputs

6a. Assigning taxonomy

6b. Generating and writing outputs

Files

GL-DPPD-7106.md

Latest commit

History

GL-DPPD-7106.md

File metadata and controls

Bioinformatics pipeline for 454 and IonTorrent amplicon sequencing data

Table of contents

Software used

Reference databases used

General processing overview with example commands

1. Raw Data QC

1a. Compile Raw Data QC

2. Trim Primers

3. Quality filtering

4. Filtered Data QC

4a. Compile Filtered Data QC

5. Generating OTUs and counts per sample

5a. Dereplicate individual samples

5b. Generate OTUs

Combining all individual sample dereplicated sequences for further processing

Dereplicating combined sequences

Clustering to get representative sequences

Removing singletons

Chimera check and removal

5c. Map reads to OTUs

6. Generating taxonomy and additional outputs

6a. Assigning taxonomy

6b. Generating and writing outputs