Skip to content

Toolkit to analyze genomic variation data, built on the GATK with Clojure

Notifications You must be signed in to change notification settings

grendon/bcbio.variation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bcbio.variation

A Clojure interface to the Genome Analysis Toolkit (GATK) to analyze variant data in VCF files. It supports scoring for the Archon Genomic X PRIZE competition but is also a general framework for variant file comparison.

Build Status

Usage

Setup

Requires Java 1.6 and Leiningen.

$ lein deps

Generate summary of concordance between variant calls

A YAML configuration file specifies the variant files for comparison. The project contains example configuration and associated variant files that demonstrate the features of the library.

An example of scoring a phased diploid genome against a haploid reference genome:

$ lein run :compare config/reference-grading.yaml

An example of assessing variant calls produced by different calling algorithms:

$ lein run :compare config/method-comparison.yaml

Web interface

A web interface automates the process of preparing configuration files and running a variant comparison:

$ lein run :web config/web-processing.yaml

Run GATK walker for variant statistics

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
  -R test/data/GRCh37.fa --variant test/data/gatk-calls.vcf --out test.png

Run custom GATK annotator

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf

Configuration file

A YAML configuration file defines targets for comparison processing. Two example files for reference grading and comparison of calling methods provide example starting points and details on available options are below:

dir:
  base: Base directory to allow use of relative paths (optional).
  out: Working directory to write output.
  prep: Prep directory where files will be pre-processed.
experiments: # one or more experiments
 - sample: Name of current sample.
   ref: Reference genome in FASTA format.
   intervals: Intervals to process in BED format (optional).
   align: Alignments for all calls in BAM format (optional).
   summary-level: Amount of summary information to provide,
                  [full,quick] (default:full)
   approach: Type of comparison to do [compare,grade]. Default compare.
   calls: # two or more calls to compare
     - name: Name of call type
       file: One or more input files in VCF format
       align: Alignment for specific call in BAM format (optional).
       ref: Reference genome if different than experiment ref (optional)
       intervals: Genome intervals to process in BED format (optional).
       refcalls: Add reference calls if has alignment info (boolean; default false).
       annotate: Annotate calls with GATK annotations (boolean; default false).
       normalize: Normalize MNPs and indels (boolean: default true).
       prep: Prep with in-order chromosomes and sample names (boolean; default false).
       preclean: Remove problematic characters from input VCFs
                 (boolean; default false). 
       remove-refcalls: Remove reference, non-variant calls.
                        (boolean; default false). 
       make-haploid: Convert a set of diploid calls to haploid variants
                    (boolean; default false)

License

The code is freely available under the MIT license.

About

Toolkit to analyze genomic variation data, built on the GATK with Clojure

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published