Amplicon Pipeline from the Sunagawa Lab

This pipeline performs all steps that we consider primary analysis of amplicon raw sequencing data. The pipeline starts with raw sequencing data and generates a taxonomically annotated amplicon sequence variant (ASV) and operational taxonomic unit (OTU) table.

Important Updates

The documentation was revised. Please let us know, if anything is unclear. We are happy to get feedback.
The primers file is now available in the templates folder.
runUSEARCH: Now performs USEARCH sequence alignment against the Silva database & last common ancestor search for ASVs as well. (this can take a few hours depending on data size) Please note that we are using a minimum sequence identity of 0.8 for a hit, which will lead to false positives. Double check the actual sequence identity shown in the output.
We added split_snake.py, which is a snakemake pipeline that separates mixed orientation reads into separate files. After separating the reads, you will have 4 files: forward R1, reverse R1, forward R2 and reverse R2. Remember to analyze the R1 and R2 files then separately, to train the error model properly (run the pipeline twice: once on fw R1 & rv R1 and once on fw R2 & rv R2).
The pipeline now supports single end data. Please see the new option in the config file. Two features are not yet supported for single ended reads: estimate parameters and insert_stats.

AUTHORS: Hans, Lilith, Chris

Documentation was last updated on 2024-08-22.

Steps of the IMB Amplicon Pipeline

This paragraph is a summary of the individual steps executed in the pipeline.

Primer Matching/Removal - `cutadapt` (`runCutadapt` step in the Config File)

In the first step we:

Remove forward and reverse primers from the paired-end reads or primers from single-end reads.
Remove inserts where either forward or reverse primer is missing (or both).
Remove multiple copies of primers --> There are few cases in which primer sequence appears multiple times in a single sequence.

For most cases (excluding blanks) you should see that >90% of sequences survive this step.

If you set allowUntrimmed to True in the config.yaml, then it will not discard untrimmed reads. You might need this option if the primers were not sequenced.

Quality Control - `dada2` (`runQC` Step in the Config File)

In this step we:

Trim a predefined number of bases from the end of every sequence - As many as possible, for paired-end it is important that we can still merge the reads. These parameters need to be set on the config file and need to be chosen based on the insert size. Trimming too many bases on paired end reads will make reads fail to merge and lead to empty ASV tables. The estimate_parameters.py script suggests parameters depending on the primer pair.
Perform quality control. Removing sequences where the number of estimated errors is > X where X is a predefined setting (defined in the config.yaml).

Error Learning - `dada2` (`runLearnErrors` Step in the Config File)

In this step we try to infer an error model using a predefined number of bases.

For paired end reads, this step is executed twice - once for the forward reads, once for the reverse reads.

ASV Inference - `dada2` (`runInference` Step in the Config File)

In this step we run the actual dada2 inference which will create the ASVs.

Read Merging - `dada2` (`runMergeReads` Step in the Config File)

This option is for merging paired-end reads. So far, we have been working on paired-end reads but not on full length inserts. This step will merge reads into a new set of ASVs. Check the number of bases trimmed in the quality control section when too few inserts merge.

Bimera Removal - `dada2` (`runRemoveBimeras` Step in the Config File)

dada2 will use the merged ASVs (paired-end) or the inferred ASVs (single-end) as input to remove potential bimeras and chimeras. The output file contains the final but unannotated ASVs.

Taxonomic Annotation/ASV Table Generation (`runASVTax` Step in the Config File)

The final ASV table is generated alongside taxonomic annotation using IDtaxa2 with cutoffs calibrated for the TARA (Oceans + Pacific) datasets.

Taxonomic Annotation/OTU Table Generation (`runOTUTax` Step in the Config File)

We produce, in addition to the ASV table, also an OTU table where ASVs are further clustered with usearch using a 97% cutoff.

Run USEARCH (`runUSEARCH` Step in the Config File)

In addition to taxonomic assignment using IDTAXA, we perform USEARCH sequence alignment against the database provided in the config.yaml parameter USEARCH_DB and search for last common ancestor (lca) for both ASVs and OTUs.

Run DefCom (`runDefCom` Step in the Config File)

Perform sequence alignment between Amplicon Sequence Variants and a reference sequence database of Defined Community members (REFERENCE_SEQUENCE_FILE). It also uses the reference sequences in ASV resolution within dada2.

Limitations

Mixed Orientations

Genoscope produces amplicon sequencing data where 1/2 reads start with the forward and the other half with the reverse primer. If you see that you get consistently ~50% of the reads through cutadapt, then you might need to check for the orientations. Use the split_snake.py pipeline to split the reads into forward and reverse reads and then run the pipeline twice (once for R1 and once for R2 reads) to train the error model properly. For the reverse primers you may have to use the reverse complement of the primer sequence. Merge the sequence tables after the inference step and run the remainder of the pipe on that data. Depending on your setup, the data at hand may not be mergeable even though there are forward and reverse reads (e.g. iSeq data with primer pairs that are far apart). In that case run the single end version of the pipeline on the sequences with the forward primers after splitting them using split_snake.py (R1 forward and R2 reverse).

N

Reads containing non ACTG bases such as N can not be processed by dada. This means that reads with the letter N will be removed. In some cases an entire cycle in a run was bad which means that you need to remove a large fraction of the reads. You could remove the early parts of the read if the issue is in the beginning. However, this is not covered in this pipeline.

How to run the Pipeline

The pipeline has been written to deal with short read Illumina sequencing data in the format of paired-end or single-end fastq files. All runs are processed together so that files/data that come from different batches/flowcells/flowlanes have to be run individually, e.g, as a different (Sub)Project.

Installations needed

Clone the Git Repo

Run this command in a folder where the pipeline should be installed (Git is required for this step. See the git installation instructions.):

git clone https://github.com/SushiLab/IMB_Amplicon_Pipeline.git

Dependencies

There are some tools that need to be installed upfront to run the pipeline. They can be wrapped into a conda environment:

$ cat environment.yaml
	name: metab-pipe
	channels:
	  - conda-forge
	  - bioconda
	  - defaults
	dependencies:
	  - python=3.9.9
	  - pip
	  - snakemake=6.13.1
	  - cutadapt=3.5
	  - bbmap=38.93
	  - r-base=4.1.2
	  - bioconductor-biostrings=2.62.0
	  - bioconductor-dada2=1.22.0
	  - r-tidyverse=1.3.1
	  - r-data.table=1.14.2
	  - bioconductor-decipher=2.22.0
	  - biopython=1.79
	  - r-patchwork
	  - r-optparse

Installation can be done using conda (see installation instructions). Navigate to the location of the git repository and run the following commands:

conda env create -f environment.yaml
conda activate metab-pipe

If it isn't already installed (try running ml USEARCH on the server), you need to install USEARCH. If that is the case, add the installation path to your .bashrc using these commands (change the path in the command to the path to your USEARCH installation):

echo 'export PATH="/path/to/your/USEARCH/installation/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

File Structure and Naming

In this section, we will cover the structure of the input data.

File Format

The pipeline has been written to deal with short read Illumina sequencing data in the format of gzipped paired-end or single-end fastq files.

Naming

The file naming of raw input files can be arbitrary as long as we know what file belongs to which sample. The actual sample name on the other hand is stricter.

We have established a certain naming scheme that is used for all (meta)genomic sequencing data of the lab, and we apply the same naming scheme for the metabarcoding data:

SUBPROJECT_SAMPLENAME_DATATYPE

When this scheme is applied to a specific project, it could look like the example below. Note that the samplename itself can contain underscores _ as well, but should not contain any dots .:

NAME24-1_P019S5_METAB
NAME24-1 = Subproject   # NAMEyy-N
P019S5 = Samplename
METAB = Datatype

If the data is published and downloaded from ENA/SRA:

XXXXNN-M_SAMEAYYYY_METAB
XXXXNN-M = first four letters of surname, year, number
SAMEAYYYY = biosample. If multiple runs come from a biosample, then use SAMEAYYYY-S001, SAMEAYYYY-S002, ...
METAB = Datatype

The exception to this rule are the Tara datasets that follow a slightly different scheme. But for those you will only see the resulting ASV/OTU table.

File Structure

Snakemake is a great tool to execute the same commands on plenty of different input files. Snakemake is also terribly inflexible with file naming/structures. Therefore, we must provide the input files in the same file structure for snakemake to execute the jobs correctly.

Again, you can provide the files in any naming/structure you like. But we will map them to the following format using map.sh:

For example, raw fastq files will be mapped to the following structure:

NAME24-1/0raw/NAME24-1_P019S5_METAB/NAME24-1_P019S5_METAB_R1.fastq.gz
NAME24-1/0raw/NAME24-1_P019S5_METAB/NAME24-1_P019S5_METAB_R2.fastq.gz

# generic:

SUBPROJECT/0raw/SUBPROJECT_SAMPLENAME_METAB/SUBPROJECT_SAMPLENAME_METAB_R1.fastq.gz
SUBPROJECT/0raw/SUBPROJECT_SAMPLENAME_METAB/SUBPROJECT_SAMPLENAME_METAB_R2.fastq.gz

Preparing Configs and Running the Pipeline

Example Config Files and Templates

Templates for these config files can be found in the templates folder.

config.yaml: The config file that contains parameters, locations of files and the rules that should be executed in the snakemake pipeline.
samples: The samples file contains a list of all samples that should be included in the analysis.
blocklist: The blocklist file contains a list of all samples that should be excluded from the analysis, e.g. when all reads of this sample are removed during QC.
primers: A set of frequently used primers for which estimate_parameters.py was optimized.
map.py: A python script to map the raw sequencing data to a standardized file and naming structure.

Example config.yaml file:

##############################
## FILE LOCATIONS AND NAMES ##
##############################
data_dir: '/path/to/your/data/' # e.g. /path/to/directory/SUNAGAWA/NAME24-1/
sample_file: 'samples'
blocklist: 'blocklist'
REFERENCE_SEQUENCE_FILE: # e.g. '/path/to/your/reference'. Can be left empty if you don't run  analysis


#####################
## USER PARAMETERS ##
#####################
FORWARD_PRIMER_NAME: '515f_parada'
REVERSE_PRIMER_NAME: '806r_caporaso'
FORWARD_PRIMER_SEQUENCE: # can be left empty if you have a primers file that contains the correct primer names
REVERSE_PRIMER_SEQUENCE: # can be left empty if you have a primers file that contains the correct primer names
QC_MINLEN: '111'
QC_TRUNC_R1: '161'
QC_TRUNC_R2: '121'
QC_MAXEE: '2'
LE_NBASES: '1e7' # The minimum number of total bases to use for error rate learning. Samples are read into memory until at least this number of total bases has been reached, or all provided samples have been read in.


##################
## STEPS TO RUN ##
##################
pairedEnd: True
runCutadapt: True
allowUntrimmed: False
runQC: True
runLearnErrors: True
runInference: True
runMergeReads: True
runRemoveBimeras: True
runReadStats: True # Run bbmap: change quality encoding (qin) and run stats to check quality
runASVTax: True
runOTUTax: True
runUSEARCH: True
runDefCom: False


#########################
## END USER PARAMETERS ##
#########################



# Standard parameters for the Sunagawa lab.
silva_training: '/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/PAN/GENERAL_METAB_ANALYSIS_PAN/data/resources/SILVA_SSU_r138_2019.RData'
primers: '/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/PAN/GENERAL_METAB_ANALYSIS_PAN/data/resources/primers'
USEARCH_DB: '/nfs/cds/Databases/SILVA/SILVA138/SILVA_138.1_SSURef_NR99_tax_silva.fasta'

Example samples file:

NAME24-1_448_NOVOGENE_METAB
NAME24-1_A_450_SUSHI_METAB
NAME24-1_A_451_SUSHI_METAB
NAME24-1_neg_mini_SUSHI_METAB

Example blocklist file. Note that this sample is also in the samples file:

NAME24-1_neg_mini_SUSHI_METAB

Running the Pipeline

Navigate to the folder where the analysis is run and copy the templates to the folder where you run the analysis. Here, we will copy all templates to a subfolder dedicated to config files. We will later on add the correct samples to the samples and blocklist files.

cd /where/you/run/your/analysis
mkdir configs
cp /where/your/git/repo/was/cloned/to/IMB_Amplicon_Pipeline/templates/* configs/
cd configs

The pipeline is best run in a screen. Here, we start a screen named amplicon_pipe:

screen -S amplicon_pipe

Activate the conda environment:

conda activate metab-pipe

Next, edit the map.py file:

Change source_folder & dest_folder accordingly (remember to add /0raw/ at the end of dest_folder).
Change r1_file & r2_file to the file ending that matches the files.
Change samplename such that it creates a unique sample name that matches with the names given by the researcher.
Update new_sample_name.

vim map.py

Add data_dir to the config.yaml file:

vim config.yaml

Here, we will run map.sh and double check that everything ran correctly.

python map.py > map.sh
./map.sh
ls -althr ../0raw/*/*gz | rev | cut -f 1 -d " " | rev | sort | uniq -c | wc -l # check whether the output from this command equals double the amount of samples

Next, we will add samples that we don't want to analyze to the blocklist and add all sample names to the samples file.

# vim blocklist # run this line in case you have samples you want to exclude from analysis
ls ../0raw/ | sort > samples # will create the samples file if you follow the same file structure as indicated above

Estimating Parameters

In order to set the appropriate user parameters in the script, we will run estimate_parameters.py, which will create a file called estimated_parameters.txt as shown in the Parameters section below. Note that estimating parameters currently only works with paired-end reads.

Primer Pairs

Creating amplicon sequencing data needs primers that are used to extract specific regions and to amplify extracted DNA in the next step. There are different protocols and different primer-sets used. The correct set of primers have to be set in the parameters to successfully run the analysis pipeline. If you are using one of the following primer pairs: 27f - 534r, 515f_caporaso - 806r_caporaso, 515f_parada - 806r_apprill, 515f_caporaso - 926r, 515f_parada - 926r or 799f - 1193r then you can specify the path to the primers file in the config.yaml (see templates) and run the script without specifying the FORWARD_PRIMER_SEQUENCE and REVERSE_PRIMER_SEQUENCE.

Add the appropriate FORWARD_PRIMER_NAME and REVERSE_PRIMER_NAME to config.yaml. If you haven't specified a primers file, add the FORWARD_PRIMER_SEQUENCE and REVERSE_PRIMER_SEQUENCE as well.

vim config.yaml

After setting the primer pair, we can run estimate_parameters.py. The script will take 5000 reads from each sample in your dataset and ...

Will estimate parameters for every primer pair provided in the primers file or only for the primer pair provided by FORWARD_PRIMER_SEQUENCE and REVERSE_PRIMER_SEQUENCE.
Will estimate the minimal length of R2 and R1 reads required for them to merge. When using your own primer pair, please note that these calculations are optimized for the primers in the primers file and may not work well with your primers.
Will output which minimum length of reads in quality control may work best for your primers. Please be careful when using this parameter if you are using your own primers (not in the primers file).
Will output how many sequences pass the quality control using different values for maxee.
And will check whether there is an issue with N bases.

You can read more on the parameters here:

Read Length (`TRUNCLENR1` and `TRUNCLENR2`)

dada2 tries to infer amplicon sequencing variants from quality controlled sequencing data and it would work best when there were no errors in the sequencing data. Quality of bases usually drops the closer you are to the end of the sequence, which is why we cut off the read towards the end. dada2 suggests the following: Trim as many bases from the end of the read so that forward and reverse read can still merge. Also, trim more from the reverse read as the forward read generally has a higher quality.

Errors (`MAXEE`)

dada2 can work with sequencing data that has erroneous bases by correcting them in the learnErrors/inference step. However, some sequences may still be of too low quality and need to be removed. For that we apply the maxee parameter which estimates how many bases in a sequence are wrong when looking at all quality values. So a value of maxee=1 would mean that one error in a 100 bp long read would be tolerated. dada2 uses the value of maxee=2 in its tutorial, and we suggest to keep it this way.

Sequences with Ns (`R1_READS_W_N` and `R2_READS_W_N`)

dada2 can't deal with sequences that contain the base N, which is why we remove sequences that contain this base. The existence of N in your data is reported by the estimate_parameters.py script. Sometimes you will have a run where you have one bad cycle and the base at that specific position is set to N in all sequences, which will remove all of your sequences when running the pipeline. If that N is in the beginning of the sequence, you could use dada2 to remove bases from the beginning, but this is not part of the pipeline.

In the following command, estimate_parameters.py will output a file called estimated_parameters.txt which contains the information mentioned above:

python /path/to/the/amplicon_pipeline/code/pipeline/estimate_parameters.py config.yaml 16 > estimated_parameters.txt
cat estimated_parameters.txt

In order to set the correct parameters for the analysis, we will take a look at estimated_parameters.txt. The example below shows the output for a test dataset. It looks like we use the 515/806 primers. The required read length of R1 is 161 (121 for R2) and there seems to be no issues with N. Using the default value 2 for the maxee parameter will keep ~95% of the sequences (if PERC_INSERTS is <90 for PARAM=MAXEE 2.0 (less than 90% of sequences survived primer matching/removal), please double-check which files are affected and contact the person, who sequenced the data).

FORWARD_PRIMER	FORWARD_PRIMER_NAME	REVERSE_PRIMER	REVERSE_PRIMER_NAME	RAW_INSERTS	AVERAGE_READ_LENGTH_R1	AVERAGE_READ_LENGTH_R2	CUTADAPT_INSERTS	PRIMER_HITS	R1_READS_W_N	R2_READS_W_N	PARAM=TRUNCLENR1	PARAM=TRUNCLENR2	PARAM=QCMINLEN	PARAM=MAXEE	QC_INSERTS	PERC_INSERTS	PLANNED_OVERLAP	ESTIMATED_INSERT_SIZE
AGAGTTTGATCCTGGCTCAG	27f	ATTACCGCGGCTGCTGG	534r	17369	250	250	0	0.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AGMGTTYGATYMYGGCTCAG	27f_mod	ATTACCGCGGCTGCTGG	534r	17369	250	250	0	0.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
CCTACGGGGNGGCWGCAG	357f	GACTACHVGGGTATCTAATCC	758r	17369	250	250	1	0.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	0.1	2974	20	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	0.2	7176	48	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	0.5	11214	76	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	1	12955	88	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	1.5	13703	93	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	2	14054	95	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	GGACTACHVGGGTWTCTAAT	806r_caporaso	17369	250	250	14661	84.00	7	4	161	121	111	2.5	14277	97	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	0.1	3033	20	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	0.2	7347	48	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	0.5	11506	76	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	1	13327	88	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	1.5	14115	93	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	2	14484	95	15	253
GTGYCAGCMGCCGCGGTAA	515f_parada	GGACTACNVGGGTWTCTAAT	806r_apprill	17369	250	250	15113	87.00	7	4	161	121	111	2.5	14715	97	15	253
GTGCCAGCMGCCGCGGTAA	515f_caporaso	CCGYCAATTYMTTTRAGTTT	926r	17369	250	250	0	0.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
GTGYCAGCMGCCGCGGTAA	515f_parada	CCGYCAATTYMTTTRAGTTT	926r	17369	250	250	0	0.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Set the parameters in the config.yaml according to the output:

vim config.yaml

QC_MINLEN: PARAM=QCMINLEN 	# if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_TRUNC_R1: PARAM=TRUNCLENR1	# if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_TRUNC_R2: PARAM=TRUNCLENR2 	# if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_MAXEE: PARAM=MAXEE

And set the steps you would like to run to True, for example:

##################
## STEPS TO RUN ##
##################
pairedEnd: True
runCutadapt: True
allowUntrimmed: False
runQC: True
runLearnErrors: True
runInference: True
runMergeReads: True
runRemoveBimeras: True
runReadStats: True
runASVTax: True
runOTUTax: True
runUSEARCH: True
runDefCom: False

When all the parameters are properly set, we will run the snakemake pipeline after loading the module USEARCH (ml USEARCH not needed if you have your own installation):

ml USEARCH
snakemake -s /path/to/amplicon_pipeline/code/pipeline/dada2_snake.py --configfile /path/to/config.yaml -j 16 -k

The number of cores should be updated depending on the resources. There is currently no Queue/Euler integration as these jobs require minutes rather than hours.

Pipeline Output Files

When the IMB amplicon pipeline finished running all of the steps possible, the folder will contain folders and files looking similar to this:

Output	File/Folder Description
0raw	Folder with symbolic links to the raw files
1cutadapt	Folder containing cutadapt and readstats output
2filterAndTrim	Folder with filter & trim and readstats output
3learnerrors	Folder with learnerrors output
4sampleInference	Folder with inference output
5mergeReads	Folder with merged reads
6bimeraRemoval	Folder with final but unannotated ASVs
7taxonomy	Folder containing temporary files for annotation
8uparsetax	Folder with USEARCH taxonomic annotation and last common ancestor (lca) output for ASVs and OTUs
configs	Folder with your configs, some scripts and estimated parameters
NAME24-1.asvs.fasta	Fasta file with ASV sequences
NAME24-1.asvs.tsv	File containing the ASVs, their taxonomic assignment and their counts for every sample
NAME24-1.benchmark	File with resource usage of the command to cluster ASVs into OTUs
NAME24-1.command	File containing the command run for clustering ASVs into OTUs
NAME24-1.done	Done file: Signals to the pipeline that ASV clustering was completed
NAME24-1.insert.counts	Output from `create_insert_stats.py`: Contains insert counts for each step of the pipeline (1 insert = 2 reads when paired end)
NAME24-1.log	File with logs from UPARSE command when clustering ASVs into OTUs
NAME24-1.otus.fasta	Fasta file with OTU sequences
NAME24-1.otus.tsv	File containing the OTUs, their taxonomic assignment and their counts for every sample
NAME24-1.otus.uparse	File with uparse output
NAME24-1.refassign.log	File with logs from defined community analysis
NAME24-1.refs.done	Done file: Signals to the pipeline that the defined community sequence alignment step was completed
NAME24-1.refs.tsv	Output file from defined community analysis: See example below

Here you can see a few examples for the files. Please note that these were shortened and adapted from other output files:

NAME24-1.asvs.tsv

asv                     otu     uparse_info             seq                               tax                                                                                                                                                                                   NAME24-1_Sample_1_METAB   NAME24-1_Sample_2_METAB       NAME24-1_Sample_3_METAB     NAME24-1_Sample_4_METAB       NAME24-1_Sample_5_METAB 
asv_0001;size=130643    otu1    *                       TACGGAGGGTGCAAGCGTTAATCGGAATTACT  rootrank;Root;100|domain;Bacteria;100|phylum;Proteobacteria;100|class;Gammaproteobacteria;71.28|order;Alteromonadales;49.38|family;Marinobacteraceae;49.38|genus;Marinobacter;49.38   33116                     161                           38968                       58390                         8       
asv_0002;size=1         otu2    dqt=49;                 TACGTAGGTGGCAAGCGTTGTCCGGATTTATT  rootrank;Root;100|domain;Bacteria;100|phylum;Firmicutes;100|class;Bacilli;100|order;Lactobacillales;89.9|family;Listeriaceae;86.82|genus;Listeria;85.51                               0                         0                             0                           0                             1
asv_0003;size=66756     otu1    dqt=2;top=otu1(99.2%);  TACGGAGGGTGCAAGCGTTAATCGGAATT     rootrank;Root;100|domain;Bacteria;100|phylum;Proteobacteria;100|class;Gammaproteobacteria;82.79|order;Alteromonadales;79.46|family;Marinobacteraceae;79.46|genus;Marinobacter;79.46   66546                     101                           0                           20                            89

NAME24-1.asvs.fasta

>asv_0001;size=130643
TACGGAGGGTGCAAGCGTTAATCGGAATTACT
>asv_0002;size=1
TACGTAGGTGGCAAGCGTTGTCCGGATTTATT

NAME24-1.otus.uparse

You can read more about interpreting the uparse output in the documentation

asv_0001;size=130643    otu1    *
asv_0002;size=1         otu2    dqt=49;

NAME24-1.refs.tsv

This output will be created if you run defined community analysis. Columns are samples and rows are the defined community members/unassigned ASVs. If some ASVs are assigned to either none or two or more defined community members (ambiguous) you can see it in the rows below the defined community members.

                              NAME24-1_Sample_1_METAB   NAME24-1_Sample_2_METAB   NAME24-1_Sample_3_METAB   NAME24-1_Sample_4_METAB   NAME24-1_Sample_5_METAB
Defined_community_member_1    30                        9411                      3785                      1                         3
Defined_community_member_2    1                         78                        56                        42                        14
Defined_community_member_3    60                        7                         0                         900                       6093
Defined_community_member_4    5                         1                         3                         70                        41
Defined_community_member_5    958                       8760                      9004                      6305                      820
asv_001 (ambiguous)           55                        0                         92                        55                        70
asv_123 (ambiguous)           0                         0                         6                         9                         0
asv_022 (none)                8                         9                         0                         0                         0

NAME24-1.insert.counts

This file contains insert counts for each step of the pipeline. Inserts consist of 2 paired-end reads. This file is only created for paired-end reads

sample  			0raw	1cutadapt	2filterAndTrim	4sampleInference_R1	4sampleInference_R2	5mergeReads	6bimeraRemoval	8asv_counts	8otu_counts
NAME24-1_Sample_1_METAB      	914802 	908413  	860250  	859721  		859781  		856405  	757325  	757325  	757010
NAME24-1_Sample_2_METAB      	819817 	813457  	760931  	760446  		760594  		757473  	670662  	670662  	670469
NAME24-1_Sample_3_METAB      	887261 	879989  	827964  	827562  		827666  		824548  	720805  	720805  	720660
NAME24-1_Sample_4_METAB      	732524 	726682  	684555  	684003  		684128  		681105  	604588  	604588  	604321

NAME24-1_asvs.tax

This file contains the taxonomic annotations with the highest percent identity found using usearch and the silva database. If the percent identity of the top matches is identical, you will find all of them in the output (see example). You can read about the details of the format in the documentation. Please always double check the actual percent identity, as we are using 80% as a cutoff, which is not at species level (It will output the species level taxonomic annotation, even though the percent identity might be at 82%, which is not at species level).

asv_0001;size=1234   CP002003.1766641.1768196 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes FSL R2-561    100.0   253     0       0       1       253     1    1556     *       *
asv_0001;size=1234   CP002002.2630961.2632498 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes 10403S        100.0   253     0       0       1       253     1    1538     *       *
asv_0001;size=1234   CP002003.1872561.1874116 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes FSL R2-561    100.0   253     0       0       1       253     1    1556     *       *

NAME24-1_asvs.lca

This file shows the last common ancestor (lca) for all silva top hits of each amplicon sequence variant. It checks all the hits for an ASV in NAME24-1_asvs.tax and outputs the rank all hits have in common. In the example below, the last common ancestor of asv_1 is Bifidobacterium and the percent identity is 100.0. In the example for asv_2, you can see that there was no hit for it in NAME24-1_asvs.tax.

asv_1;size=1337      Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium;   100.0
asv_2;size=1234              0

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
code		code
templates		templates
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
HOWTO		HOWTO
LICENSE		LICENSE
PROGRESS		PROGRESS
README.md		README.md
environment.yaml		environment.yaml

License

SushiLab/IMB_Amplicon_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Amplicon Pipeline from the Sunagawa Lab

Important Updates

Steps of the IMB Amplicon Pipeline

Primer Matching/Removal - cutadapt (runCutadapt step in the Config File)

Quality Control - dada2 (runQC Step in the Config File)

Error Learning - dada2 (runLearnErrors Step in the Config File)

ASV Inference - dada2 (runInference Step in the Config File)

Read Merging - dada2 (runMergeReads Step in the Config File)

Bimera Removal - dada2 (runRemoveBimeras Step in the Config File)

Taxonomic Annotation/ASV Table Generation (runASVTax Step in the Config File)

Taxonomic Annotation/OTU Table Generation (runOTUTax Step in the Config File)

Run USEARCH (runUSEARCH Step in the Config File)

Run DefCom (runDefCom Step in the Config File)