RNAIndel calls coding indels from tumor RNA-Seq data and classifies them as somatic, germline, and artifactual. RNAIndel supports GRCh38 and 37.
Explore the docs »
Read the paper »
Request Feature
|
Report Bug
⭐ Consider starring the repo! ⭐
New implementation with indelpost, an indel realigner/phaser.
- faster analysis (typically < 20 min with 8 cores)
- somatic complex indel calling in RNA-Seq
- ensemble calling with your own caller (e.g., GATK HaplotypeCaller/MuTect2)
- improved sensitivity for homopolymer indels by error-profile outlier analysis
RNAIndel can be executed via Docker or run locally, downloadable via PyPI.
We publish our latest docker builds on GitHub. You can run the latest code base by running the following command
> docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest
If you want to have a more native feel, you can add an alias to your shell's rc file.
> alias rnaindel="docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest"
Note: if its the first time you are executing the docker run
command, you will see the output of docker downloading the image
RNAIndel depends on python>=3.8.0 and java>=1.8.0.
Installing via the pip command will install the following packages:
- indelpost>=0.0.4
- pysam>=0.15.0
- cython>=0.29.12
- numpy>=1.16.0
- ssw-py>=1.0.1
- pandas>=0.23.0
- scikit-learn>=0.22.0
> pip install indelpost --no-binary indelpost --no-build-isolation
> pip install rnaindel
Test the installation.
> rnaindel -h
usage: rnaindel <subcommand> [<args>]
subcommands are:
SetUp Initialize predicition models
PredictIndels Predict somatic/germline/artifact indels from tumor RNA-Seq data
CalculateFeatures Calculate and report features for training
Train Perform model training
CountOccurrence Count occurrence within cohort to filter false somatic predictions
positional arguments:
subcommand PredictIndels, CalculateFeatures, Train, CountOccurrence
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Download data package (version 3 is not compatible with the previous data package).
#GRCh38
curl -LO https://zenodo.org/records/10552784/files/data_dir_grch38.tar.gz
tar -zxf data_dir_grch38.tar.gz
#GRCh37
curl -LO https://zenodo.org/records/10552784/files/data_dir_grch37.tar.gz
tar -zxf data_dir_grch37.tar.gz
RNAIndel has 5 subcommands:
SetUp
pretrain the model with user's sklearn versionPredictIndels
analyze RNA-Seq data for indel discoveryCalculateFeatures
calculate features for trainingTrain
train models with user's datasetCountOccurrence
annotate over-represented somatic predictions
Subcommands are invoked:
> rnaindel subcommand [subcommand-specific options]
Run the first-time-only command. Takes 5 to 10 minutes to complete.
NOTE: not required to run the docker image.
> rnaindel SetUp -d data_dir
RNAIndel expects STAR 2-pass mapped BAM file with sorted by coordinate and MarkDuplicates. Further preprocessing such as indel realignment may prevent desired behavior.
This mode uses the built-in caller to analyze simple and complex indels.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -p 8 (default 1)
Indels in the exernal VCF (supplied by -v) are integrated to the callset by the built-in caller to boost performance.
See demo.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v gatk.vcf.gz -p 8
Somatic predictions from RNA-Seq are validated against DNA-Seq on the fly.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -t tumor.dna.bam -n normal.dna.bam -p 8
Leverage all resources for best performance.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v mutect2.vcf.gz -t tumor.dna.bam -n normal.dna.bam -p 8
-
-i
input STAR-mapped BAM file (required) -
-o
output VCF file (required) -
-r
reference genome FASTA file (required) -
-d
data directory contains trained models and databases (required) -
-v
VCF file (must be .vcf.gz + index) from user's caller. (default: None) -
-p
number of cores (default: 1) -
other options (click to open)
-t
Tumor DNA-Seq BAM file (default: None)-n
Normal DNA-Seq BAM file (default: None)-q
STAR mapping quality MAPQ for unique mappers (default: 255)-m
maximum heap space (default: 6000m)--region
target genomic region. specify by chrN:start-stop (default: None)--pon
user's defined list of non-somatic calls such as PanelOfNormals. Supply as .vcf.gz with index (default: None)--include-all-external-calls
set to include all indels in VCF file supplied by -v. (default: False. Use only calls with PASS in FILTER)--skip-homopolyer-outlier-analysis
no outlier analysis for homopolymer indels (repeat > 4) performed if set. (default: False)--safety-mode
deactivate parallelism at realignment step. may be required to run with -p > 1 on some platforms. (default: False)--deactivate-sensitive-mode
deactivate additional realignments for soft-clipped reads. (default: False)
Using pediatric tumor RNA-Seq samples (SJC-DS-1003, n=77), the time and memory consumption was benchmarked for ensemble calling with 8 cores (i.e., -p 8) on a server with 32-core AMD EPYC 7542 CPU @2.90 GHz.
Run time (wall) | Max memory | |
---|---|---|
median | 374 sec | 18.6 GB |
max | 1388 sec | 23.5 GB |
Users can train RNAIndel with their own training set.
Check occurrence to filter probable false positives.
- kohei.hagiwara[AT]stjude.org
Published in Bioinformatics