Skip to content

Commit

Permalink
Merge pull request #34 from Illumina/GT-825
Browse files Browse the repository at this point in the history
GT-825 v2.4a release
  • Loading branch information
traxexx authored Sep 24, 2019
2 parents cd9a580 + 970796c commit ea342ac
Show file tree
Hide file tree
Showing 20 changed files with 121 additions and 842 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,16 @@

## <a name='Introduction'></a>Introduction

Accurate genotyping of known variants is a critical for analysis of whole-genome sequencing data.

Paragraph aims to facilitate these tasks by providing:
- an accurate genotyper for Structural Variations in short-read data
- a suite of graph-based tools to align and genotype complex events.
Accurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.

Please reference Paragraph using:

- Chen, et al (2019) [Paragraph: A graph-based structural variant genotyper for short-read sequence data](https://www.biorxiv.org/content/10.1101/635011v1). *bioRxiv*. doi: https://doi.org/10.1101/635011
- Chen, et al (2019) [Paragraph: A graph-based structural variant genotyper for short-read sequence data](https://www.biorxiv.org/content/10.1101/635011v2). *bioRxiv*. doi: https://doi.org/10.1101/635011

(Second version uploaded at September 24, 2019)

Genotyping data in this paper can be found at [paper-data/download-instructions.txt](paper-data/download-instructions.txt)

Genotyping calls in this paper can be found at [paper-data/download-instructions.txt](paper-data/download-instructions.txt)

## <a name='Installation'></a>Installation

Expand Down
26 changes: 18 additions & 8 deletions paper-data/download-instructions.txt
Original file line number Diff line number Diff line change
@@ -1,15 +1,25 @@
Please use the following S3 link to download the output VCF from Paragraph manuscript:
Please use the following S3 link to download data.

Genotypes of HG002 Long-read ground truth (LRGT) SVs on the Illumina HiSeq X 34.5x bam (VCF format):
https://s3-us-west-1.amazonaws.com/paragraph-paper-data/hg002_sniffles_ccs.paragraph.vcf.gz
The VCF for long read ground truth (with PBSV genotypes):
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/sample3-sw.sorted.pass.vcf.gz

Note that chrX and chrY were excluded in our analysis

HG002 Long-read ground truth (LRGT) SVs on 100 individuals from Polaris (JSON format):
Site only:
https://s3-us-west-1.amazonaws.com/paragraph-paper-data/sniffles_ccs_polaris.filtered.autosome.del_ins.json.gz
The VCF for all SVs (LRGT + Clustered SVs):
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/sample3-sw.sorted.vcf.gz

Genotypes included:
https://s3-us-west-1.amazonaws.com/paragraph-paper-data/sniffles_ccs_polaris.json.gz
In the filter field, we have "PASS" for LRGT SVs and "NEARBY" for clustered SVs. Note that chrX and chrY were excluded in all analysis in our manuscript.

Paragraph genotypes of LRGT and clustered SVs for:
NA24385/HG002:
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/HG002.paragraph.vcf.gz
NA12878:
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/NA12878.paragraph.vcf.gz
NA24361/HG005:
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/HG005.paragraph.vcf.gz

Paragraph genotyping summary of LRGT and clustered SVs in the 100 unrelated individuals in the Polaris population:
https://paragraph-paper-data.s3-us-west-1.amazonaws.com/polaris.summary.csv

Sample name map (S3 ID to regular ID):
https://s3-us-west-1.amazonaws.com/paragraph-paper-data/sample_map.txt
Expand Down
6 changes: 3 additions & 3 deletions share/test-data/genotyping_test_2/expected-genotypes.vcf
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
chrA 1500 swap1 GCTGCCCCTT GCTAGTAACTT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:1 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 0/0:0/0:1100:PASS:604,4:604,4:0,0:0,3364,32458
chrB 1499 swap2 TAGGCCATACG TTCAGGTTGTCTTATGCTTGGCATCGTTCTT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:2 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 1/1:1/1:952:PASS:0,596:0,596:0,0:32425,3031,0
chrC 1500 swap3 CGCGTTGTAAGCTACCATATTCAATCTGTGCCAGGGATCGAGCCACAGGCACCGCTCAATCTCGCGGGAGATTGTGCAAAGAGTCTTACCTTTCGTCGACCTCCGCCTCGCTCGTGAATCTTGCGATCGATTGAAAGTCACGGGTAGAGTGATGTTCGGGCGAATCAGACAGGCAGATGCAATGGAGGTTCCCGGATAGT CCGGAGATACCCTCTGTCTCCGCTAACATTTCCCCGCGGACAAAATTTGTCGGCTGGGAGGAATAGGTGCAAACGCATAATATACCCCTCTTACTTTTTGTTAGGGTCTAGTCCGAATCTAAAAAATGACTAAGGACTCTCAGAGTGATGGATATATGCCTCGCGACGCCGATCTGTGCTTATGTCGCAGCTTTGGCATCAAACCAGTTTCACATACCCTGCCTAAAAGATTCCCATACTGCGAAATCGCAAGATTGTACAAGTTGTAGTCTGTGCGCCAGCGTGAGCACGGCACTCGGT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:3 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 0/1:0/1:1008:PASS:538,538:538,538:0,0:4132,0,4132
chrA 1500 swap1 GCTGCCCCTT GCTAGTAACTT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:1 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 0/0:0/0:1100:PASS:2396,10:2396,10:0,0:0,3364,32458
chrB 1499 swap2 TAGGCCATACG TTCAGGTTGTCTTATGCTTGGCATCGTTCTT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:2 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 1/1:1/1:952:PASS:0,1788:0,1788:0,0:32425,3031,0
chrC 1500 swap3 CGCGTTGTAAGCTACCATATTCAATCTGTGCCAGGGATCGAGCCACAGGCACCGCTCAATCTCGCGGGAGATTGTGCAAAGAGTCTTACCTTTCGTCGACCTCCGCCTCGCTCGTGAATCTTGCGATCGATTGAAAGTCACGGGTAGAGTGATGTTCGGGCGAATCAGACAGGCAGATGCAATGGAGGTTCCCGGATAGT CCGGAGATACCCTCTGTCTCCGCTAACATTTCCCCGCGGACAAAATTTGTCGGCTGGGAGGAATAGGTGCAAACGCATAATATACCCCTCTTACTTTTTGTTAGGGTCTAGTCCGAATCTAAAAAATGACTAAGGACTCTCAGAGTGATGGATATATGCCTCGCGACGCCGATCTGTGCTTATGTCGCAGCTTTGGCATCAAACCAGTTTCACATACCCTGCCTAAAAGATTCCCATACTGCGAAATCGCAAGATTGTACAAGTTGTAGTCTGTGCGCCAGCGTGAGCACGGCACTCGGT . PASS GRMPY_ID=swaps.vcf@42527ba8a8840f1c955f8e6879b567988bbf858febd25ba5b4555895dbbcfef7:3 GT:OLD_GT:DP:FT:AD:ADF:ADR:PL 0/1:0/1:1008:PASS:1762,1426:1762,1426:0,0:4132,0,4132
100 changes: 42 additions & 58 deletions share/test-data/paragraph/insertions/insertion-test-1.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,89 +2,75 @@
"edges": [
{
"from": "source",
"name": "source_chr22:32-31:TGAGC",
"to": "chr22:32-31:TGAGC"
"name": "source_chr20:149014-149013:CTGAC",
"to": "chr20:149014-149013:CTGAC"
},
{
"from": "source",
"name": "source_ref-chr22:26-30",
"to": "ref-chr22:26-30"
"name": "source_ref-chr20:149008-149013",
"to": "ref-chr20:149008-149013"
},
{
"from": "chr22:32-31:TGAGC",
"name": "chr22:32-31:TGAGC_ref-chr22:32-36",
"from": "chr20:149014-149013:CTGAC",
"name": "chr20:149014-149013:CTGAC_ref-chr20:149014-149018",
"sequences": [
"20482:1"
"HG002_pbsv.INS.49079:1"
],
"to": "ref-chr22:32-36"
"to": "ref-chr20:149014-149018"
},
{
"from": "ref-chr22:26-30",
"name": "ref-chr22:26-30_ref-chr22:31-31",
"from": "ref-chr20:149008-149013",
"name": "ref-chr20:149008-149013_chr20:149014-149013:CCATA",
"sequences": [
"20482:0",
"REF"
"HG002_pbsv.INS.49079:1"
],
"to": "ref-chr22:31-31"
"to": "chr20:149014-149013:CCATA"
},
{
"from": "ref-chr22:31-31",
"name": "ref-chr22:31-31_ref-chr22:32-36",
"from": "ref-chr20:149008-149013",
"name": "ref-chr20:149008-149013_ref-chr20:149014-149018",
"sequences": [
"20482:0",
"HG002_pbsv.INS.49079:0",
"REF"
],
"to": "ref-chr22:32-36"
"to": "ref-chr20:149014-149018"
},
{
"from": "ref-chr22:31-31",
"name": "ref-chr22:31-31_chr22:32-31:AGGGC",
"sequences": [
"20482:1"
],
"to": "chr22:32-31:AGGGC"
},
{
"from": "ref-chr22:32-36",
"name": "ref-chr22:32-36_sink",
"from": "chr20:149014-149013:CCATA",
"name": "chr20:149014-149013:CCATA_sink",
"to": "sink"
},
{
"from": "chr22:32-31:AGGGC",
"name": "chr22:32-31:AGGGC_sink",
"from": "ref-chr20:149014-149018",
"name": "ref-chr20:149014-149018_sink",
"to": "sink"
}
],
"model_name": "Graph from insertions/insertion-test-1.vcf",
"model_name": "Graph from ../paragraph-tools/share/test-data/paragraph/insertions/insertion-test-1.vcf",
"nodes": [
{
"name": "source",
"sequence": "NNNNNNNNNN"
},
{
"name": "chr22:32-31:TGAGC",
"position": "chr22:32-31",
"sequence": "TGAGC"
},
{
"name": "ref-chr22:26-30",
"reference": "chr22:26-30",
"reference_sequence": "AGACC"
"name": "chr20:149014-149013:CTGAC",
"position": "chr20:149014-149013",
"sequence": "CTGAC"
},
{
"name": "ref-chr22:31-31",
"reference": "chr22:31-31",
"reference_sequence": "A"
"name": "ref-chr20:149008-149013",
"reference": "chr20:149008-149013",
"reference_sequence": "GAACAA"
},
{
"name": "ref-chr22:32-36",
"reference": "chr22:32-36",
"reference_sequence": "G"
"name": "chr20:149014-149013:CCATA",
"position": "chr20:149014-149013",
"sequence": "CCATA"
},
{
"name": "chr22:32-31:AGGGC",
"position": "chr22:32-31",
"sequence": "AGGGC"
"name": "ref-chr20:149014-149018",
"reference": "chr20:149014-149018",
"reference_sequence": "CCATA"
},
{
"name": "sink",
Expand All @@ -94,38 +80,36 @@
"paths": [
{
"nodes": [
"ref-chr22:26-30",
"ref-chr22:31-31",
"ref-chr22:32-36"
"ref-chr20:149008-149013",
"ref-chr20:149014-149018"
],
"path_id": "REF|1",
"sequence": "REF"
},
{
"nodes": [
"ref-chr22:26-30",
"ref-chr22:31-31",
"chr22:32-31:AGGGC"
"ref-chr20:149008-149013",
"chr20:149014-149013:CCATA"
],
"path_id": "ALT|1",
"sequence": "ALT"
},
{
"nodes": [
"chr22:32-31:TGAGC",
"ref-chr22:32-36"
"chr20:149014-149013:CTGAC",
"ref-chr20:149014-149018"
],
"path_id": "ALT|2",
"sequence": "ALT"
}
],
"sequencenames": [
"20482:0",
"20482:1",
"ALT",
"HG002_pbsv.INS.49079:0",
"HG002_pbsv.INS.49079:1",
"REF"
],
"target_regions": [
"chr22:26-36"
"chr20:149008-149018"
]
}
84 changes: 34 additions & 50 deletions share/test-data/paragraph/insertions/insertion-test-1.noas.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,74 +2,60 @@
"edges": [
{
"from": "source",
"name": "source_ref-chr22:26-30",
"to": "ref-chr22:26-30"
"name": "source_ref-chr20:149008-149013",
"to": "ref-chr20:149008-149013"
},
{
"from": "ref-chr22:26-30",
"name": "ref-chr22:26-30_ref-chr22:31-31",
"from": "ref-chr20:149008-149013",
"name": "ref-chr20:149008-149013_chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC",
"sequences": [
"20482:0",
"REF"
"HG002_pbsv.INS.49079:1"
],
"to": "ref-chr22:31-31"
"to": "chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC"
},
{
"from": "ref-chr22:31-31",
"name": "ref-chr22:31-31_chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC",
"from": "ref-chr20:149008-149013",
"name": "ref-chr20:149008-149013_ref-chr20:149014-149018",
"sequences": [
"20482:1"
],
"to": "chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC"
},
{
"from": "ref-chr22:31-31",
"name": "ref-chr22:31-31_ref-chr22:32-36",
"sequences": [
"20482:0",
"HG002_pbsv.INS.49079:0",
"REF"
],
"to": "ref-chr22:32-36"
"to": "ref-chr20:149014-149018"
},
{
"from": "chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC",
"name": "chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC_ref-chr22:32-36",
"from": "chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC",
"name": "chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC_ref-chr20:149014-149018",
"sequences": [
"20482:1"
"HG002_pbsv.INS.49079:1"
],
"to": "ref-chr22:32-36"
"to": "ref-chr20:149014-149018"
},
{
"from": "ref-chr22:32-36",
"name": "ref-chr22:32-36_sink",
"from": "ref-chr20:149014-149018",
"name": "ref-chr20:149014-149018_sink",
"to": "sink"
}
],
"model_name": "Graph from insertions/insertion-test-1.vcf",
"model_name": "Graph from ../paragraph-tools/share/test-data/paragraph/insertions/insertion-test-1.vcf",
"nodes": [
{
"name": "source",
"sequence": "NNNNNNNNNN"
},
{
"name": "ref-chr22:26-30",
"reference": "chr22:26-30",
"reference_sequence": "AGACC"
},
{
"name": "ref-chr22:31-31",
"reference": "chr22:31-31",
"reference_sequence": "A"
"name": "ref-chr20:149008-149013",
"reference": "chr20:149008-149013",
"reference_sequence": "GAACAA"
},
{
"name": "chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC",
"position": "chr22:32-31",
"sequence": "AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC"
"name": "chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC",
"position": "chr20:149014-149013",
"sequence": "CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC"
},
{
"name": "ref-chr22:32-36",
"reference": "chr22:32-36",
"reference_sequence": "G"
"name": "ref-chr20:149014-149018",
"reference": "chr20:149014-149018",
"reference_sequence": "CCATA"
},
{
"name": "sink",
Expand All @@ -79,31 +65,29 @@
"paths": [
{
"nodes": [
"ref-chr22:26-30",
"ref-chr22:31-31",
"ref-chr22:32-36"
"ref-chr20:149008-149013",
"ref-chr20:149014-149018"
],
"path_id": "REF|1",
"sequence": "REF"
},
{
"nodes": [
"ref-chr22:26-30",
"ref-chr22:31-31",
"chr22:32-31:AGGGCAAACATTCAGGACACAGCAGAGTATTGTTGTAATCCTATGTGAGC",
"ref-chr22:32-36"
"ref-chr20:149008-149013",
"chr20:149014-149013:CCATATTTGGGAGGCAATTTTACCTGTTCTCAAGGCCGCATCTCTACCCCATCTCATGCGAATCCTGAC",
"ref-chr20:149014-149018"
],
"path_id": "ALT|1",
"sequence": "ALT"
}
],
"sequencenames": [
"20482:0",
"20482:1",
"ALT",
"HG002_pbsv.INS.49079:0",
"HG002_pbsv.INS.49079:1",
"REF"
],
"target_regions": [
"chr22:26-36"
"chr20:149008-149018"
]
}
2 changes: 0 additions & 2 deletions share/test-data/paragraph/insertions/insertion-test-1.ref.fa

This file was deleted.

This file was deleted.

Loading

0 comments on commit ea342ac

Please sign in to comment.