Skip to content

Latest commit

 

History

History
109 lines (80 loc) · 4.51 KB

README.md

File metadata and controls

109 lines (80 loc) · 4.51 KB

bcbio.variation

A Clojure interface to the Genome Analysis Toolkit (GATK) to analyze variant data in VCF files. It supports scoring for the Archon Genomic X PRIZE competition but is also a general framework for variant file comparison.

Build Status

Usage

Setup

Requires Java 1.6 and Leiningen.

$ lein deps

Generate summary of concordance between variant calls

A YAML configuration file specifies the variant files for comparison. The project contains example configuration and associated variant files that demonstrate the features of the library.

An example of scoring a phased diploid genome against a haploid reference genome:

$ lein run :compare config/reference-grading.yaml

An example of assessing variant calls produced by different calling algorithms:

$ lein run :compare config/method-comparison.yaml

Web interface

A web interface automates the process of preparing configuration files and running a variant comparison:

$ lein run :web config/web-processing.yaml

Run GATK walker for variant statistics

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
  -R test/data/GRCh37.fa --variant test/data/gatk-calls.vcf --out test.png

Run custom GATK annotator

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf

Configuration file

A YAML configuration file defines targets for comparison processing. Two example files for reference grading and comparison of calling methods provide example starting points and details on available options are below:

dir:
  base: Base directory to allow use of relative paths (optional).
  out: Working directory to write output.
  prep: Prep directory where files will be pre-processed.
experiments: # one or more experiments
 - sample: Name of current sample.
   ref: Reference genome in FASTA format.
   intervals: Intervals to process in BED format (optional).
   align: Alignments for all calls in BAM format (optional).
   summary-level: Amount of summary information to provide,
                  [full,quick] (default:full)
   approach: Type of comparison to do [compare,grade]. Default compare.
   calls: # two or more calls to compare
     - name: Name of call type
       file: One or more input files in VCF format
       align: Alignment for specific call in BAM format (optional).
       ref: Reference genome if different than experiment ref (optional)
       intervals: Genome intervals to process in BED format (optional).
       refcalls: Add reference calls if has alignment info (boolean; default false).
       annotate: Annotate calls with GATK annotations (boolean; default false).
       normalize: Normalize MNPs and indels (boolean: default true).
       prep: Prep with in-order chromosomes and sample names (boolean; default false).
       preclean: Remove problematic characters from input VCFs
                 (boolean; default false). 
       remove-refcalls: Remove reference, non-variant calls.
                        (boolean; default false). 
       make-haploid: Convert a set of diploid calls to haploid variants
                    (boolean; default false)

License

The code is freely available under the MIT license.