A computational workflow to identify Neoantigens from Structural Variations.
- Support BEDPE format as input
- Fix bugs related to NetMHCpan 4.1 (NetMHCpan 4.0 will no longer be supported by NeoSV.)
- Add an additional parameter erc, which enable users filter neoantigens by EL (eluted ligand) rank
Neoantigens are considered as ideal targets for immunotherapies because they are tumor-specifc and not subject to immune tolerance. Previous studies have been focused on single nucleotide variation (SNV) and insertion-and-deletion (indel), with the neoantigens from structural variation (SV) poorly characterized.
We developed a Python package-NeoSV-to annotate the effect of SVs on protein and predict potential neoantigens created by SVs. We have successfully applied NeoSV to thousands of tumor genomes from Pan Cancer Analysis of Whole Genomes (PCAWG) and constructed a comprehensive repertoire of SV-derived neoantigens. For more details, please read our paper:
Shi, Y., Jing, B. & Xi, R. Comprehensive analysis of neoantigens derived from structural variation across whole genomes from 2528 tumors. Genome Biol 24, 169 (2023)
- Python (>3.6). NeoSV should work well with all versions of Python3, but has been only tested on Python > 3.6
- NetMHCpan (4.1). After you sign up and get the link for downloading, there will be a accompanied guidance on how to configure netMHCpan.
- PyPI: if you already have python and pip, you can directly install NeoSV via
pip install neosv
- Source code: we noted that sometimes pip will not install the binary file neosv, is such case you can download the package and install it using
python setup.py install
. Please remember to install biopython and pyensembl using pip before installation.
NeoSV requires 3 types of inputs:
-
Variant file: a file in VCF format or BEDPE format which lists all SVs you want to analyze. Template files: test.sv.vcf and test.sv.bedpe
-
HLA file: a file listing the HLA alleles line by line. This usually includes six HLA alleles for an individual. HLA should be in 4 digit format like HLA-A*02:01. Template file: test.hla.txt
-
Reference file: NeoSV utilizes pyensembl for SV annotation, thus a reference for pyensembl is needed. There are 3 ways to prepare it:
- Pre-download by pyensembl (recommended): When you install NeoSV using pip or conda, pyensembl will be automatically installed as well. Then you can download the reference:
export PYENSEMBL_CACHE_DIR=/custom/cache/dir # specify the location for storing reference pyensembl install --release <list of Ensembl release numbers> --species <species-name> # download, for hg19 please use release 75, for hg38 please used release 96
- Automatically download by NeoSV: If NeoSV did not detect a valid reference in
--pyensembl-cache-dir
, it will automatically download one to that folder. Please make sure the internet connection of your system, since some high performance computing nodes have no network. - Prepare the reference file manually: This would be useful if your data is not from human or mouse. Then you need to prepare the reference by yourself. A FASTA file and a GTF file will be enough. For more details please see the guidance. In addition, you need to confirm the MHC alleles in that species are supported by NetMHCpan.
- Pre-download by pyensembl (recommended): When you install NeoSV using pip or conda, pyensembl will be automatically installed as well. Then you can download the reference:
- Quick start: suppose you have a variant file named
test.sv.vcf
, a HLA file namedtest.hla.txt
. Your pyensembl reference is human sapiens release 75 and located at/pyensembl/
, then a typical NeoSV command is:neosv -sf test.sv.vcf -hf test.hla.txt -np /path/to/netmhcpan -o test -p test -r 75
- Below is detailed description for each parameter:
Argument Description -h
,--help
show the help message -sf
,--sv-file
Structural variants in VCF or BEDPE format. NeoSV will automatically identify the format according to the file suffix. -hf
,--hla-file
HLA alleles (resolution: 4 digit), with one allele per line. -np
,--netmhc-path
Absolute path to the NetMHCpan execution file, please skip this argument if NetMHCpan has been added to your PATH. -o
,--out
Folder for all result files. A new folder will be created if it does not exist. -p
,--prefix
This prefix will be added to all output files. -r
,--release
The release of Ensembl to use. Valid release versions can be found here. Ensembl release for hg19/GRCh37, hg38/GRCh38 are 75, 96. -gf
,--gtf-file
GTF file for the reference, only needed when you want to prepare the ensembl reference by yourself. -cf
,--cdna-file
cDNA file for the reference, only needed when you want to prepare the ensembl reference by yourself. -pd
,--pyemsembl-cache-dir
Directory for Pyensembl cache files. If not specified, the platform-specific cache folder will be used -l
,--epitope-lengths
Lengths of neoepitopes to predict MHC binding. Default: 8-11. -ic
,--ic50-cutoff
Filter neoepitopes with IC50 (nM) above this value. Default: 500. -brc
,--ba-rank-cutoff
Filter neoepitopes with BA-rank above this value. Default: 2. -erc
,--el-rank-cutoff
Filter neoepitopes with EL-rank above this value. Default: 2. -ct
,--complete-transcript
Only complete transcripts will be considered for SV annotation. Default: True. --anno-only
Only annotate SV without predicting neoantigens.If this argument is added, --hla-file is not required, and you will only get the annotation result.
Several files will be generated in the output directory, you may have interest in the files suffixed by neoantigen.filtered.txt and anno.filtered.txt
-
{prefix}.neoantigen.filtered.txt stores all information of the candidate neantigens:
Column index Column name Content 1 chrom1 Chromosome of the 1st breakpoint 2 pos1 Genommic position of the 1st breakpoint 3 gene1 Gene name of the 1st breakpoint 4 transcript_id1 Ensembl transcript ID of the 1st breakpoint 5 chrom2 Chromosome of the 2nd breakpoint 6 pos2 Genommic position of the 2nd breakpoint 7 gene2 Gene name of the 2nd breakpoint 8 transcript_id2 Ensembl transcript ID of the 2nd breakpoint 9 svpattern 10 svtype SV types according to the orientation of junction read. Values: DUP, DEL, TRA, t2tINV, or h2hINV. 11 frameshift The effect on open reading frame. Values: In-frame, Stop-gain, Stop-loss, Start-loss. 12 neoantigen Amino acid sequence of the neoantigen 13 allele HLA allele that binds to the neoantigen 14 affinity Binding affinity (nM) provided by NetMHCpan 15 BA_rank BA rank of the binding provided by NetMHCpan 16 EL_rank EL rank of the binding provided by NetMHCpan. From NetMHCpan4.0, EL rank is the most recommended feature for filtering neoantigens. -
{prefix}.anno.filtered.txt stores all annotations of the SVs:
Column index Column name Content 1 chrom1 Chromosome of the 1st breakpoint. 2 pos1 Genommic position of the 1st breakpoint. 3 function1 Location of the 1st breadpoint relative to a gene. Values: Intergenic, Intron, Exon. 4 gene1 Gene name of the 1st breakpoint. 5 transcript_id1 Ensembl transcript ID of the 1st breakpoint 6 strand1 Coding strand of the 1st gene. Values: +, -, None (if intergenic) 7 transcript_retain1 The part being retained of transcript, I/i indicates intron, E/e indicates exon. Upper case means an intact exon/intron, while lower case means the exon/intron is truncated by this SV 8 chrom2 Chromosome of the 2nd breakpoint 9 pos2 Genommic position of the 2nd breakpoint 10 function2 Location of the 1st breadpoint relative to a gene. Values: Intergenic, Intron, Exon. 11 gene2 Gene name of the 2nd breakpoint 12 transcript_id2 Ensembl transcript ID of the 2nd breakpoint 13 strand2 Coding strand of the 2nd gene. Values: +, -, None (if intergenic) 14 transcript_retain2 15 svpattern 16 svtype SV types according to the orientation of junction read. Values: DUP, DEL, TRA, t2tINV, or h2hINV. 17 fusion Whether this SV can lead to a functional gene fusion. It should be noted that the fusion is not restricted to two-gene fusion. -
{prefix}.net.in.txt stores the peptides fed to netMHCpan.
-
{prefix}.net.out.txt stores the raw output from netMHCpan.
NeoSV is licensed under the terms of MIT license.