A Clojure interface to the Genome Analysis Toolkit (GATK) to analyze variant data in VCF files. It supports scoring for the Archon Genomic X PRIZE competition but is also a general framework for variant file comparison.
Requires Java 1.6 and Leiningen.
$ lein deps
A YAML configuration file specifies the variant files for comparison. The project contains example configuration and associated variant files that demonstrate the features of the library.
An example of scoring a phased diploid genome against a haploid reference genome:
$ lein run :compare config/reference-grading.yaml
An example of assessing variant calls produced by different calling algorithms:
$ lein run :compare config/method-comparison.yaml
A web interface automates the process of preparing configuration files and running a variant comparison:
$ lein run :web config/web-processing.yaml
$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
-R test/data/GRCh37.fa --variant test/data/gatk-calls.vcf --out test.png
$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
-A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
--variant test/data/gatk-calls.vcf -o annotated-file.vcf
A YAML configuration file defines targets for comparison processing. Two example files for reference grading and comparison of calling methods provide example starting points and details on available options are below:
dir:
base: Base directory to allow use of relative paths (optional).
out: Working directory to write output.
prep: Prep directory where files will be pre-processed.
experiments: # one or more experiments
- sample: Name of current sample.
ref: Reference genome in FASTA format.
intervals: Intervals to process in BED format (optional).
align: Alignments for all calls in BAM format (optional).
summary-level: Amount of summary information to provide,
[full,quick] (default:full)
approach: Type of comparison to do [compare,grade]. Default compare.
calls: # two or more calls to compare
- name: Name of call type
file: One or more input files in VCF format
align: Alignment for specific call in BAM format (optional).
ref: Reference genome if different than experiment ref (optional)
intervals: Genome intervals to process in BED format (optional).
refcalls: Add reference calls if has alignment info (boolean; default false).
annotate: Annotate calls with GATK annotations (boolean; default false).
normalize: Normalize MNPs and indels (boolean: default true).
prep: Prep with in-order chromosomes and sample names (boolean; default false).
preclean: Remove problematic characters from input VCFs
(boolean; default false).
remove-refcalls: Remove reference, non-variant calls.
(boolean; default false).
make-haploid: Convert a set of diploid calls to haploid variants
(boolean; default false)
The code is freely available under the MIT license.