This pipeline performs all steps that we consider primary analysis of amplicon raw sequencing data. The pipeline starts with raw sequencing data and generates a taxonomically annotated amplicon sequence variant (ASV) and operational taxonomic unit (OTU) table.
- The documentation was revised. Please let us know, if anything is unclear. We are happy to get feedback.
- The primers file is now available in the templates folder.
runUSEARCH
: Now performs USEARCH sequence alignment against the Silva database & last common ancestor search for ASVs as well. (this can take a few hours depending on data size) Please note that we are using a minimum sequence identity of 0.8 for a hit, which will lead to false positives. Double check the actual sequence identity shown in the output.- We added
split_snake.py
, which is a snakemake pipeline that separates mixed orientation reads into separate files. After separating the reads, you will have 4 files: forward R1, reverse R1, forward R2 and reverse R2. Remember to analyze the R1 and R2 files then separately, to train the error model properly (run the pipeline twice: once on fw R1 & rv R1 and once on fw R2 & rv R2). - The pipeline now supports single end data. Please see the new option in the config file. Two features are not yet supported for single ended reads: estimate parameters and insert_stats.
Documentation was last updated on 2024-08-22.
This paragraph is a summary of the individual steps executed in the pipeline.
In the first step we:
- Remove forward and reverse primers from the paired-end reads or primers from single-end reads.
- Remove inserts where either forward or reverse primer is missing (or both).
- Remove multiple copies of primers --> There are few cases in which primer sequence appears multiple times in a single sequence.
For most cases (excluding blanks) you should see that >90% of sequences survive this step.
If you set allowUntrimmed
to True
in the config.yaml
, then it will not discard untrimmed reads. You might need this option if the primers were not sequenced.
In this step we:
- Trim a predefined number of bases from the end of every sequence - As many as possible, for paired-end it is important that we can still merge the reads. These parameters need to be set on the config file and need to be chosen based on the insert size. Trimming too many bases on paired end reads will make reads fail to merge and lead to empty ASV tables. The estimate_parameters.py script suggests parameters depending on the primer pair.
- Perform quality control. Removing sequences where the number of estimated errors is > X where X is a predefined setting (defined in the
config.yaml
).
In this step we try to infer an error model using a predefined number of bases.
For paired end reads, this step is executed twice - once for the forward reads, once for the reverse reads.
In this step we run the actual dada2 inference which will create the ASVs.
This option is for merging paired-end reads. So far, we have been working on paired-end reads but not on full length inserts. This step will merge reads into a new set of ASVs. Check the number of bases trimmed in the quality control section when too few inserts merge.
dada2
will use the merged ASVs (paired-end) or the inferred ASVs (single-end) as input to remove potential bimeras and chimeras. The output file contains the final but unannotated ASVs.
The final ASV table is generated alongside taxonomic annotation using IDtaxa2 with cutoffs calibrated for the TARA (Oceans + Pacific) datasets.
We produce, in addition to the ASV table, also an OTU table where ASVs are further clustered with usearch
using a 97% cutoff.
In addition to taxonomic assignment using IDTAXA
, we perform USEARCH sequence alignment against the database provided in the config.yaml
parameter USEARCH_DB
and search for last common ancestor (lca) for both ASVs and OTUs.
Perform sequence alignment between Amplicon Sequence Variants and a reference sequence database of Defined Community members (REFERENCE_SEQUENCE_FILE
).
It also uses the reference sequences in ASV resolution within dada2
.
Genoscope produces amplicon sequencing data where 1/2 reads start with the forward and the other half with the reverse primer. If you see that you get consistently ~50% of the reads through cutadapt, then you might need to check for the orientations.
Use the split_snake.py
pipeline to split the reads into forward and reverse reads and then run the pipeline twice (once for R1 and once for R2 reads) to train the error model properly. For the reverse primers you may have to use the reverse complement of the primer sequence. Merge the sequence tables after the inference step and run the remainder of the pipe on that data. Depending on your setup, the data at hand may not be mergeable even though there are forward and reverse reads (e.g. iSeq data with primer pairs that are far apart). In that case run the single end version of the pipeline on the sequences with the forward primers after splitting them using split_snake.py
(R1 forward and R2 reverse).
Reads containing non ACTG
bases such as N
can not be processed by dada. This means that reads with the letter N
will be removed. In some cases an entire cycle in a run was bad which means that you need to remove a large fraction of the reads. You could remove the early parts of the read if the issue is in the beginning. However, this is not covered in this pipeline.
The pipeline has been written to deal with short read Illumina sequencing data in the format of paired-end or single-end fastq files. All runs are processed together so that files/data that come from different batches/flowcells/flowlanes have to be run individually, e.g, as a different (Sub)Project
.
Run this command in a folder where the pipeline should be installed (Git is required for this step. See the git installation instructions.):
git clone https://github.com/SushiLab/IMB_Amplicon_Pipeline.git
There are some tools that need to be installed upfront to run the pipeline. They can be wrapped into a conda environment:
$ cat environment.yaml
name: metab-pipe
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.9.9
- pip
- snakemake=6.13.1
- cutadapt=3.5
- bbmap=38.93
- r-base=4.1.2
- bioconductor-biostrings=2.62.0
- bioconductor-dada2=1.22.0
- r-tidyverse=1.3.1
- r-data.table=1.14.2
- bioconductor-decipher=2.22.0
- biopython=1.79
- r-patchwork
- r-optparse
Installation can be done using conda (see installation instructions). Navigate to the location of the git repository and run the following commands:
conda env create -f environment.yaml
conda activate metab-pipe
If it isn't already installed (try running ml USEARCH
on the server), you need to install USEARCH
.
If that is the case, add the installation path to your .bashrc
using these commands (change the path in the command to the path to your USEARCH installation):
echo 'export PATH="/path/to/your/USEARCH/installation/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
In this section, we will cover the structure of the input data.
The pipeline has been written to deal with short read Illumina sequencing data in the format of gzipped paired-end or single-end fastq files.
The file naming of raw input files can be arbitrary as long as we know what file belongs to which sample. The actual sample name on the other hand is stricter.
We have established a certain naming scheme that is used for all (meta)genomic sequencing data of the lab, and we apply the same naming scheme for the metabarcoding data:
SUBPROJECT_SAMPLENAME_DATATYPE
When this scheme is applied to a specific project, it could look like the example below. Note that the samplename itself can contain underscores _
as well, but should not contain any dots .
:
NAME24-1_P019S5_METAB
NAME24-1 = Subproject # NAMEyy-N
P019S5 = Samplename
METAB = Datatype
If the data is published and downloaded from ENA/SRA:
XXXXNN-M_SAMEAYYYY_METAB
XXXXNN-M = first four letters of surname, year, number
SAMEAYYYY = biosample. If multiple runs come from a biosample, then use SAMEAYYYY-S001, SAMEAYYYY-S002, ...
METAB = Datatype
The exception to this rule are the Tara datasets that follow a slightly different scheme. But for those you will only see the resulting ASV/OTU table.
Snakemake
is a great tool to execute the same commands on plenty of different input files. Snakemake
is also terribly inflexible with file naming/structures. Therefore, we must provide the input files in the same file structure for snakemake to execute the jobs correctly.
Again, you can provide the files in any naming/structure you like. But we will map them to the following format using map.sh
:
For example, raw fastq files will be mapped to the following structure:
NAME24-1/0raw/NAME24-1_P019S5_METAB/NAME24-1_P019S5_METAB_R1.fastq.gz
NAME24-1/0raw/NAME24-1_P019S5_METAB/NAME24-1_P019S5_METAB_R2.fastq.gz
# generic:
SUBPROJECT/0raw/SUBPROJECT_SAMPLENAME_METAB/SUBPROJECT_SAMPLENAME_METAB_R1.fastq.gz
SUBPROJECT/0raw/SUBPROJECT_SAMPLENAME_METAB/SUBPROJECT_SAMPLENAME_METAB_R2.fastq.gz
Templates for these config files can be found in the templates
folder.
config.yaml
: The config file that contains parameters, locations of files and the rules that should be executed in the snakemake pipeline.samples
: The samples file contains a list of all samples that should be included in the analysis.blocklist
: The blocklist file contains a list of all samples that should be excluded from the analysis, e.g. when all reads of this sample are removed during QC.primers
: A set of frequently used primers for whichestimate_parameters.py
was optimized.map.py
: A python script to map the raw sequencing data to a standardized file and naming structure.
Example config.yaml
file:
##############################
## FILE LOCATIONS AND NAMES ##
##############################
data_dir: '/path/to/your/data/' # e.g. /path/to/directory/SUNAGAWA/NAME24-1/
sample_file: 'samples'
blocklist: 'blocklist'
REFERENCE_SEQUENCE_FILE: # e.g. '/path/to/your/reference'. Can be left empty if you don't run analysis
#####################
## USER PARAMETERS ##
#####################
FORWARD_PRIMER_NAME: '515f_parada'
REVERSE_PRIMER_NAME: '806r_caporaso'
FORWARD_PRIMER_SEQUENCE: # can be left empty if you have a primers file that contains the correct primer names
REVERSE_PRIMER_SEQUENCE: # can be left empty if you have a primers file that contains the correct primer names
QC_MINLEN: '111'
QC_TRUNC_R1: '161'
QC_TRUNC_R2: '121'
QC_MAXEE: '2'
LE_NBASES: '1e7' # The minimum number of total bases to use for error rate learning. Samples are read into memory until at least this number of total bases has been reached, or all provided samples have been read in.
##################
## STEPS TO RUN ##
##################
pairedEnd: True
runCutadapt: True
allowUntrimmed: False
runQC: True
runLearnErrors: True
runInference: True
runMergeReads: True
runRemoveBimeras: True
runReadStats: True # Run bbmap: change quality encoding (qin) and run stats to check quality
runASVTax: True
runOTUTax: True
runUSEARCH: True
runDefCom: False
#########################
## END USER PARAMETERS ##
#########################
# Standard parameters for the Sunagawa lab.
silva_training: '/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/PAN/GENERAL_METAB_ANALYSIS_PAN/data/resources/SILVA_SSU_r138_2019.RData'
primers: '/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/PAN/GENERAL_METAB_ANALYSIS_PAN/data/resources/primers'
USEARCH_DB: '/nfs/cds/Databases/SILVA/SILVA138/SILVA_138.1_SSURef_NR99_tax_silva.fasta'
Example samples
file:
NAME24-1_448_NOVOGENE_METAB
NAME24-1_A_450_SUSHI_METAB
NAME24-1_A_451_SUSHI_METAB
NAME24-1_neg_mini_SUSHI_METAB
Example blocklist
file. Note that this sample is also in the samples
file:
NAME24-1_neg_mini_SUSHI_METAB
Navigate to the folder where the analysis is run and copy the templates to the folder where you run the analysis. Here, we will copy all templates to a subfolder dedicated to config files. We will later on add the correct samples to the samples
and blocklist
files.
cd /where/you/run/your/analysis
mkdir configs
cp /where/your/git/repo/was/cloned/to/IMB_Amplicon_Pipeline/templates/* configs/
cd configs
The pipeline is best run in a screen. Here, we start a screen named amplicon_pipe
:
screen -S amplicon_pipe
Activate the conda environment:
conda activate metab-pipe
Next, edit the map.py
file:
- Change
source_folder
&dest_folder
accordingly (remember to add/0raw/
at the end ofdest_folder
). - Change
r1_file
&r2_file
to the file ending that matches the files. - Change
samplename
such that it creates a unique sample name that matches with the names given by the researcher. - Update
new_sample_name
.
vim map.py
Add data_dir
to the config.yaml
file:
vim config.yaml
Here, we will run map.sh
and double check that everything ran correctly.
python map.py > map.sh
./map.sh
ls -althr ../0raw/*/*gz | rev | cut -f 1 -d " " | rev | sort | uniq -c | wc -l # check whether the output from this command equals double the amount of samples
Next, we will add samples that we don't want to analyze to the blocklist
and add all sample names to the samples
file.
# vim blocklist # run this line in case you have samples you want to exclude from analysis
ls ../0raw/ | sort > samples # will create the samples file if you follow the same file structure as indicated above
In order to set the appropriate user parameters in the script, we will run estimate_parameters.py
, which will create a file called estimated_parameters.txt
as shown in the Parameters section below. Note that estimating parameters currently only works with paired-end reads.
Creating amplicon sequencing data needs primers that are used to extract specific regions and to amplify extracted DNA in the next step. There are different protocols and different primer-sets used. The correct set of primers have to be set in the parameters to successfully run the analysis pipeline.
If you are using one of the following primer pairs: 27f - 534r
, 515f_caporaso - 806r_caporaso
, 515f_parada - 806r_apprill
, 515f_caporaso - 926r
, 515f_parada - 926r
or 799f - 1193r
then you can specify the path to the primers
file in the config.yaml
(see templates) and run the script without specifying the FORWARD_PRIMER_SEQUENCE
and REVERSE_PRIMER_SEQUENCE
.
Add the appropriate FORWARD_PRIMER_NAME
and REVERSE_PRIMER_NAME
to config.yaml
. If you haven't specified a primers file, add the FORWARD_PRIMER_SEQUENCE
and REVERSE_PRIMER_SEQUENCE
as well.
vim config.yaml
After setting the primer pair, we can run estimate_parameters.py
. The script will take 5000 reads from each sample in your dataset and ...
- Will estimate parameters for every primer pair provided in the primers file or only for the primer pair provided by
FORWARD_PRIMER_SEQUENCE
andREVERSE_PRIMER_SEQUENCE
. - Will estimate the minimal length of R2 and R1 reads required for them to merge. When using your own primer pair, please note that these calculations are optimized for the primers in the primers file and may not work well with your primers.
- Will output which minimum length of reads in quality control may work best for your primers. Please be careful when using this parameter if you are using your own primers (not in the primers file).
- Will output how many sequences pass the quality control using different values for
maxee
. - And will check whether there is an issue with
N
bases.
You can read more on the parameters here:
dada2
tries to infer amplicon sequencing variants from quality controlled sequencing data and it would work best when there were no errors in the sequencing data. Quality of bases usually drops the closer you are to the end of the sequence, which is why we cut off the read towards the end. dada2
suggests the following: Trim as many bases from the end of the read so that forward and reverse read can still merge. Also, trim more from the reverse read as the forward read generally has a higher quality.
dada2
can work with sequencing data that has erroneous bases by correcting them in the learnErrors/inference step. However, some sequences may still be of too low quality and need to be removed. For that we apply the maxee
parameter which estimates how many bases in a sequence are wrong when looking at all quality values. So a value of maxee=1
would mean that one error in a 100 bp long read would be tolerated. dada2
uses the value of maxee=2
in its tutorial, and we suggest to keep it this way.
dada2
can't deal with sequences that contain the base N
, which is why we remove sequences that contain this base. The existence of N
in your data is reported by the estimate_parameters.py
script. Sometimes you will have a run where you have one bad cycle and the base at that specific position is set to N
in all sequences, which will remove all of your sequences when running the pipeline. If that N
is in the beginning of the sequence, you could use dada2
to remove bases from the beginning, but this is not part of the pipeline.
In the following command, estimate_parameters.py
will output a file called estimated_parameters.txt
which contains the information mentioned above:
python /path/to/the/amplicon_pipeline/code/pipeline/estimate_parameters.py config.yaml 16 > estimated_parameters.txt
cat estimated_parameters.txt
In order to set the correct parameters for the analysis, we will take a look at estimated_parameters.txt
. The example below shows the output for a test dataset. It looks like we use the 515/806 primers. The required read length of R1 is 161 (121 for R2) and there seems to be no issues with N
. Using the default value 2
for the maxee
parameter will keep ~95% of the sequences (if PERC_INSERTS
is <90 for PARAM=MAXEE
2.0 (less than 90% of sequences survived primer matching/removal), please double-check which files are affected and contact the person, who sequenced the data).
FORWARD_PRIMER | FORWARD_PRIMER_NAME | REVERSE_PRIMER | REVERSE_PRIMER_NAME | RAW_INSERTS | AVERAGE_READ_LENGTH_R1 | AVERAGE_READ_LENGTH_R2 | CUTADAPT_INSERTS | PRIMER_HITS | R1_READS_W_N | R2_READS_W_N | PARAM=TRUNCLENR1 | PARAM=TRUNCLENR2 | PARAM=QCMINLEN | PARAM=MAXEE | QC_INSERTS | PERC_INSERTS | PLANNED_OVERLAP | ESTIMATED_INSERT_SIZE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AGAGTTTGATCCTGGCTCAG | 27f | ATTACCGCGGCTGCTGG | 534r | 17369 | 250 | 250 | 0 | 0.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
AGMGTTYGATYMYGGCTCAG | 27f_mod | ATTACCGCGGCTGCTGG | 534r | 17369 | 250 | 250 | 0 | 0.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CCTACGGGGNGGCWGCAG | 357f | GACTACHVGGGTATCTAATCC | 758r | 17369 | 250 | 250 | 1 | 0.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 0.1 | 2974 | 20 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 0.2 | 7176 | 48 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 0.5 | 11214 | 76 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 1 | 12955 | 88 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 1.5 | 13703 | 93 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 2 | 14054 | 95 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | GGACTACHVGGGTWTCTAAT | 806r_caporaso | 17369 | 250 | 250 | 14661 | 84.00 | 7 | 4 | 161 | 121 | 111 | 2.5 | 14277 | 97 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 0.1 | 3033 | 20 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 0.2 | 7347 | 48 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 0.5 | 11506 | 76 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 1 | 13327 | 88 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 1.5 | 14115 | 93 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 2 | 14484 | 95 | 15 | 253 |
GTGYCAGCMGCCGCGGTAA | 515f_parada | GGACTACNVGGGTWTCTAAT | 806r_apprill | 17369 | 250 | 250 | 15113 | 87.00 | 7 | 4 | 161 | 121 | 111 | 2.5 | 14715 | 97 | 15 | 253 |
GTGCCAGCMGCCGCGGTAA | 515f_caporaso | CCGYCAATTYMTTTRAGTTT | 926r | 17369 | 250 | 250 | 0 | 0.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
GTGYCAGCMGCCGCGGTAA | 515f_parada | CCGYCAATTYMTTTRAGTTT | 926r | 17369 | 250 | 250 | 0 | 0.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Set the parameters in the config.yaml
according to the output:
vim config.yaml
QC_MINLEN: PARAM=QCMINLEN # if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_TRUNC_R1: PARAM=TRUNCLENR1 # if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_TRUNC_R2: PARAM=TRUNCLENR2 # if using other primers than in the primers file, please double check whether this estimated parameter suits your data
QC_MAXEE: PARAM=MAXEE
And set the steps you would like to run to True
, for example:
##################
## STEPS TO RUN ##
##################
pairedEnd: True
runCutadapt: True
allowUntrimmed: False
runQC: True
runLearnErrors: True
runInference: True
runMergeReads: True
runRemoveBimeras: True
runReadStats: True
runASVTax: True
runOTUTax: True
runUSEARCH: True
runDefCom: False
When all the parameters are properly set, we will run the snakemake pipeline after loading the module USEARCH (ml USEARCH
not needed if you have your own installation):
ml USEARCH
snakemake -s /path/to/amplicon_pipeline/code/pipeline/dada2_snake.py --configfile /path/to/config.yaml -j 16 -k
The number of cores should be updated depending on the resources. There is currently no Queue/Euler integration as these jobs require minutes rather than hours.
When the IMB amplicon pipeline finished running all of the steps possible, the folder will contain folders and files looking similar to this:
Output | File/Folder Description |
---|---|
0raw | Folder with symbolic links to the raw files |
1cutadapt | Folder containing cutadapt and readstats output |
2filterAndTrim | Folder with filter & trim and readstats output |
3learnerrors | Folder with learnerrors output |
4sampleInference | Folder with inference output |
5mergeReads | Folder with merged reads |
6bimeraRemoval | Folder with final but unannotated ASVs |
7taxonomy | Folder containing temporary files for annotation |
8uparsetax | Folder with USEARCH taxonomic annotation and last common ancestor (lca) output for ASVs and OTUs |
configs | Folder with your configs, some scripts and estimated parameters |
NAME24-1.asvs.fasta | Fasta file with ASV sequences |
NAME24-1.asvs.tsv | File containing the ASVs, their taxonomic assignment and their counts for every sample |
NAME24-1.benchmark | File with resource usage of the command to cluster ASVs into OTUs |
NAME24-1.command | File containing the command run for clustering ASVs into OTUs |
NAME24-1.done | Done file: Signals to the pipeline that ASV clustering was completed |
NAME24-1.insert.counts | Output from create_insert_stats.py : Contains insert counts for each step of the pipeline (1 insert = 2 reads when paired end) |
NAME24-1.log | File with logs from UPARSE command when clustering ASVs into OTUs |
NAME24-1.otus.fasta | Fasta file with OTU sequences |
NAME24-1.otus.tsv | File containing the OTUs, their taxonomic assignment and their counts for every sample |
NAME24-1.otus.uparse | File with uparse output |
NAME24-1.refassign.log | File with logs from defined community analysis |
NAME24-1.refs.done | Done file: Signals to the pipeline that the defined community sequence alignment step was completed |
NAME24-1.refs.tsv | Output file from defined community analysis: See example below |
Here you can see a few examples for the files. Please note that these were shortened and adapted from other output files:
asv otu uparse_info seq tax NAME24-1_Sample_1_METAB NAME24-1_Sample_2_METAB NAME24-1_Sample_3_METAB NAME24-1_Sample_4_METAB NAME24-1_Sample_5_METAB
asv_0001;size=130643 otu1 * TACGGAGGGTGCAAGCGTTAATCGGAATTACT rootrank;Root;100|domain;Bacteria;100|phylum;Proteobacteria;100|class;Gammaproteobacteria;71.28|order;Alteromonadales;49.38|family;Marinobacteraceae;49.38|genus;Marinobacter;49.38 33116 161 38968 58390 8
asv_0002;size=1 otu2 dqt=49; TACGTAGGTGGCAAGCGTTGTCCGGATTTATT rootrank;Root;100|domain;Bacteria;100|phylum;Firmicutes;100|class;Bacilli;100|order;Lactobacillales;89.9|family;Listeriaceae;86.82|genus;Listeria;85.51 0 0 0 0 1
asv_0003;size=66756 otu1 dqt=2;top=otu1(99.2%); TACGGAGGGTGCAAGCGTTAATCGGAATT rootrank;Root;100|domain;Bacteria;100|phylum;Proteobacteria;100|class;Gammaproteobacteria;82.79|order;Alteromonadales;79.46|family;Marinobacteraceae;79.46|genus;Marinobacter;79.46 66546 101 0 20 89
>asv_0001;size=130643
TACGGAGGGTGCAAGCGTTAATCGGAATTACT
>asv_0002;size=1
TACGTAGGTGGCAAGCGTTGTCCGGATTTATT
You can read more about interpreting the uparse output in the documentation
asv_0001;size=130643 otu1 *
asv_0002;size=1 otu2 dqt=49;
This output will be created if you run defined community analysis. Columns are samples and rows are the defined community members/unassigned ASVs. If some ASVs are assigned to either none or two or more defined community members (ambiguous) you can see it in the rows below the defined community members.
NAME24-1_Sample_1_METAB NAME24-1_Sample_2_METAB NAME24-1_Sample_3_METAB NAME24-1_Sample_4_METAB NAME24-1_Sample_5_METAB
Defined_community_member_1 30 9411 3785 1 3
Defined_community_member_2 1 78 56 42 14
Defined_community_member_3 60 7 0 900 6093
Defined_community_member_4 5 1 3 70 41
Defined_community_member_5 958 8760 9004 6305 820
asv_001 (ambiguous) 55 0 92 55 70
asv_123 (ambiguous) 0 0 6 9 0
asv_022 (none) 8 9 0 0 0
This file contains insert counts for each step of the pipeline. Inserts consist of 2 paired-end reads. This file is only created for paired-end reads
sample 0raw 1cutadapt 2filterAndTrim 4sampleInference_R1 4sampleInference_R2 5mergeReads 6bimeraRemoval 8asv_counts 8otu_counts
NAME24-1_Sample_1_METAB 914802 908413 860250 859721 859781 856405 757325 757325 757010
NAME24-1_Sample_2_METAB 819817 813457 760931 760446 760594 757473 670662 670662 670469
NAME24-1_Sample_3_METAB 887261 879989 827964 827562 827666 824548 720805 720805 720660
NAME24-1_Sample_4_METAB 732524 726682 684555 684003 684128 681105 604588 604588 604321
This file contains the taxonomic annotations with the highest percent identity found using usearch and the silva database. If the percent identity of the top matches is identical, you will find all of them in the output (see example). You can read about the details of the format in the documentation. Please always double check the actual percent identity, as we are using 80% as a cutoff, which is not at species level (It will output the species level taxonomic annotation, even though the percent identity might be at 82%, which is not at species level).
asv_0001;size=1234 CP002003.1766641.1768196 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes FSL R2-561 100.0 253 0 0 1 253 1 1556 * *
asv_0001;size=1234 CP002002.2630961.2632498 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes 10403S 100.0 253 0 0 1 253 1 1538 * *
asv_0001;size=1234 CP002003.1872561.1874116 Bacteria;Firmicutes;Bacilli;Lactobacillales;Listeriaceae;Listeria;Listeria monocytogenes FSL R2-561 100.0 253 0 0 1 253 1 1556 * *
This file shows the last common ancestor (lca) for all silva top hits of each amplicon sequence variant. It checks all the hits for an ASV in NAME24-1_asvs.tax
and outputs the rank all hits have in common. In the example below, the last common ancestor of asv_1 is Bifidobacterium and the percent identity is 100.0. In the example for asv_2, you can see that there was no hit for it in NAME24-1_asvs.tax
.
asv_1;size=1337 Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium; 100.0
asv_2;size=1234 0