Gene catalog pipeline

Description

Requirements

BBmap
samtools
sambamba
iMGMC repository

please export PATH of github clone to you PATH variable:
git clone https://github.com/tillrobin/iMGMC)
PATH=${PWD}/iMGCM/scripts/:${PWD}/iMGCM/PICRUSt:${PATH }

Data-pre-processing

1. Demultiplexing and adapter removing

All reads were processed with Ilumina bcl2fastq Conversion

2. Quality check of the reads

Very low quality should be trimmed. Otherwise you do not need to preform quality trimming or any kind of error correction. If you want to trim you data you can just remove the "untrim" parameter in the next step. Please see bbmap/bbduk documentation for more details.

3. Removal of contaminated host

You can use bbmap and the reference mouse genome from Ensembl.

# for each Sample ($SampleName) with ReadR1 ($Fastq_R1) and ReadR2 ($Fastq_R2) we preform
bbmap.sh -Xmx50g usejni=t unpigz=t threads=10 fast=t \
minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast minhits=2 qtrim=r trimq=10 untrim idtag printunmappedcount kfilter=25 maxsites=1 k=14 \
in=${Fastq_R1} \
in2=${Fastq_R2} \
ref=Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz \
statsfile=${SampleName}_rmhost_filtering.stats \
outu=${SampleName}_R1_rmhost.fastq.gz \
outu2=${SampleName}_R2_rmhost.fastq.gz

Database-settings

1. Download iMGMC data

download.sh

2. Index iMGMC catalog

cd iMGCM-data
bbmap.sh ref=iMGMC-GeneID.fasta.gz

3. Export PATH for the data

export iMGCM-data=$PWD

Mapping

For this tutorial we use bbmap for mapping the reads to iMGCM catalog (ORFs: iMGCM-GeneID). You can use an other mapper and process the standard output (sam-files) with bbmap to summarize the counts. Please see bbmap documentation for more details.

# for each Sample ($SampleName) with ReadR1 ($Fastq_R1) and ReadR2 ($Fastq_R2) we preform:
bbmap.sh -Xmx30g unpigz=t threads=${usedCores} minid=0.90 \
ref=${iMGCM-data} nodisk \
statsfile=${SampleName}.statsfile \
scafstats=${SampleName}.scafstats \
covstats=${SampleName}.covstat \
rpkm=${SampleName}.rpkm \
sortscafs=f nzo=f \
in=${SampleName}_R1_rmhost.fastq.gz \
in2=${SampleName}_R2_rmhost.fastq.gz

TPM-normalization

You need to normalize you samples counts for a comparison. Here we use TPM-normalization to create ${SampleID}.TPM form ${SampleName}.covstat :

make-GeneID-TPM-fromCovStats.sh ${SampleName}.covstat

Basic-statistics

To get basic statistics for samples eg. to import in other software like R you need a ${SampleName}.covstat for each sample. All scripts used below you can find here.

Kegg-KO annotations

To add Kegg KO annotations to our "${SampleID}.TPM" file you need the mapping file iMGMC-map-GeneID-KeggKO.tab. GeneID without a KO annotation will be obmitted. For each GeneID only one KO will be added. In addition the summary file "KOsum-${SampleID}.tab" will be generated, including summarized TPM for each KO in the sample.

make-KO-TPM-fromTPM.sh ${SampleName}.TPM

Summary of KO statistics for all samples

The script will conduct all "${SampleName}.TPM" and "KOsum-{SampleName}.tab" files into a table including SampleID and a header (long-format table). These files can be eg. imported to R.

sumup-TPM-KOsum-files.sh

See an example for an gene and KO ordination in the tutorial section.

Mapping GeneID to orginal ContigID and BinID

In this step we summarize the TPM of ORFs (GeneID) to the corresponding ContigID and BinID for each "${SampleID}.TPM" file. You need the mapping file iMGMC-map-Gene-Contig-Bin.tab. First we add the ContigID and BinID information to the "${SampleID}.TPM" -> "GeneID-ContigID-BinID-${SampleID}.tab". In addition two summary files "ContigID-sumTPM-${SampleID}.tab" and "BinID-sumTPM-${SampleID}.tab" will be generated, including summarized TPM form each sample.

make-GeneID-ContigID-BinId-TPM-fromTPM.sh ${SampleName}.TPM

Summary of ContigID and BinID statistics for all samples

The script will conduct all "ContigID-sumTPM-${SampleID}.tab" and "BinID-sumTPM-${SampleID}.tab" files including SampleID and a header (long-format table). These files can be eg. imported to R.

sumup-TPM-ContigID-BinID-files.sh

Results:

SampleID-ContigID-TPM.tab : Sum of gene TPM-counts for each sample for each ContigID
SampleID-ContigID-TPM-min1.tab : Sum of gene TPM-counts for each sample for each ContigID with a minimum of TPM>=1
SampleID-BinID-TPM.tab : Sum of gene TPM-counts for each sample for each BinID
SampleID-BinID-TPM-min1.tab : Sum of gene TPM-counts for each sample for each BinID with a minimum of TPM>=1

Taxonomic profiling of TPM counts

For taxonomic profiling of the reads you can use the taxonomic classification of the ORFs or the Contigs. If you interested in bacterial species you may consider the MAG-pipeline and the CoverM-tutorial.

# Taxomic profiling over different taxonomic levels using GeneID annotations from "${SampleID}.TPM"
sumup-TPM-for-TaxLevels-fromTPM.sh ${SampleName}.TPM


# Taxomic profiling over different taxonomic levels using ContigID annotations from "ContigID-sumTPM-${SampleID}.tab"
sumup-TPM-for-TaxLevels-fromContigTPM.sh ContigID-sumTPM-${SampleID}.tab

The resulting files sumTPM-Taxonomic-${SampleID}.tab contains summarized TPM-counts for each level in the long format. Annother file contains only the top 12 abundant taxa.

TaxLevel	Taxonomy	TPM
Superkingdom	Bacteria	505299
Superkingdom	unclassified	394423
Superkingdom	Eukaryota	99557.4
Superkingdom	NA	475.052
Superkingdom	Viruses	226.726
Superkingdom	Archaea	19.2614
Phylum	unclassified	577044
Phylum	Firmicutes	275744
Phylum	Chordata	54223
Phylum	Chlamydiae	43202.9
Phylum	Nematoda	17881.1

This data can used for downstream processing like ploting in R:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genecatalog-pipeline.md

genecatalog-pipeline.md

Gene catalog pipeline

Table of Contents

Description

Requirements

Data-pre-processing

Database-settings

Mapping

TPM-normalization

Basic-statistics

Kegg-KO annotations

Summary of KO statistics for all samples

Mapping GeneID to orginal ContigID and BinID

Summary of ContigID and BinID statistics for all samples

Taxonomic profiling of TPM counts

Files

genecatalog-pipeline.md

Latest commit

History

genecatalog-pipeline.md

File metadata and controls

Gene catalog pipeline

Table of Contents

Description

Requirements

Data-pre-processing

Database-settings

Mapping

TPM-normalization

Basic-statistics

Kegg-KO annotations

Summary of KO statistics for all samples

Mapping GeneID to orginal ContigID and BinID

Summary of ContigID and BinID statistics for all samples

Taxonomic profiling of TPM counts