Normalization of RNA-seq gene expression data. Supported methods:
- Counts per million (CPM)
- Transcript per kilobase million (TPM)
- Quantile normalization to average distribution
The TPM normalization can either accept pre-computed gene lengths on the input or compute gene lengths from gene annotation in GTF format, using the union exon-based approach. The computed gene lengths are identical to the lengths reported by featureCounts (validated for Homo sapiens, Mus musculus, Rattus norvegicus, and Macaca mulatta of ENSEMBL and UCSC annotations).
Quantile normalization is implemented as described on Wikipedia. First, we compute an average distribution by sorting each sample (column) and taking the mean over rows to determine the rank values. Second, we compute ranks over columns (samples) and substitute the rank with the rank value (average expression for each rank).
Install rnanorm
Python package:
pip install rnanorm
See rnanorm
command help:
rnanorm --help
Run rnanorm
with pre-computed gene lengths:
rnanorm expr.tsv --cpm-output=expr.cpm.tsv --tpm-output=expr.tpm.tsv --gene-lengths=lengths.tsv
Run rnanorm
with genome annotation - gene lengths will be computed on the fly:
rnanorm expr.tsv --cpm-output=expr.cpm.tsv --tpm-output=expr.tpm.tsv --annotation=annot.gtf
For quantile normalization we suggest using TPM expressions on the input:
rnanorm expr.tpm.tsv --quantile-output=expr.quantile.tsv
Install rnanorm
Python package for development:
flit install --deps=all --symlink
Run all tests and linters:
tox