-
Notifications
You must be signed in to change notification settings - Fork 0
Detection of GFF3 format errors
The GFF3 format is flexible and easy to use for most biologists, but this flexibility also allows many errors to be introduced. This QC program aims to detect over 50 types of formatting errors. Please note that the program suite now works best for protein-coding genes in 'canonical' Sequence ontology style, but we will add new functionality for non-coding features in the near future.
Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type):
- Intra-model errors (Ema) – multiple features within a model
- Inter-model errors (Emr) – multiple features across models
- Single feature errors (Esf) – each single feature.
Below we list all errors currently considered by gff-QC.py, including the error code, the error tag (a brief explanation of the error), and a note whether the error is fully implemented in gff-QC.py.
View the gff-QC.py readme for instructions on how to run the program.
The error category 'Intra-model' collects formatting errors that can be found by jointly considering multiple features within a gene model, such as gene, mRNA, exon, and CDS features. Errors in this category are given an 'Error_Code' starting with 'Ema'.
Error_Code | Error_Tag | Note |
---|---|---|
Ema0001 | Parent feature start and end coordinates exceed those of child features | Done |
Ema0002 | Protein sequence contains internal stop codons | Done |
Ema0003 | This feature is not contained within the parent feature coordinates | Done |
Ema0004 | Incomplete gene feature that should contain at least one mRNA, exon, and CDS | Done |
Ema0005 | Pseudogene has invalid child feature type | Done |
Ema0006 | Wrong phase | Done |
Ema0007 | CDS and parent feature on different strands | Done |
Ema0008 | Warning for distinct isoforms that do not share any regions | Done |
Ema0009 | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found | Done |
The error category 'Inter-model' collects formatting errors that can be found by comparing multiple gene models. Errors in this category are given an 'Error_Code' starting with 'Emr'.
Error_Code | Error_Tag | Note |
---|---|---|
Emr0001 | Duplicate transcript found | Done |
Emr0002 | Incorrectly split gene parent? | Done |
Emr0003 | Duplicate ID | Done |
The error category 'Single Feature' collects formatting errors that can be found by searching the GFF3 file line by line. Errors in this category are given an 'Error_Code' starting with 'Esf'.
Error_Code | Error_Tag | Note |
---|---|---|
Esf0001 | Feature type may need to be changed to pseudogene | Done |
Esf0002 | Start/Stop is not a valid 1-based integer coordinate | Done |
Esf0003 | strand information missing | Done |
Esf0004 | Seqid not found in any ##sequence-region | Done |
Esf0005 | Start is less than the ##sequence-region start | Done |
Esf0006 | End is greater than the ##sequence-region end | Done |
Esf0007 | Seqid not found in the embedded ##FASTA | Done |
Esf0008 | End is greater than the embedded ##FASTA sequence length | Done |
Esf0009 | Found Ns in a feature using the embedded ##FASTA | Done |
Esf0010 | Seqid not found in the external FASTA file | Done |
Esf0011 | End is greater than the external FASTA sequence length | Done |
Esf0012 | Found Ns in a feature using the external FASTA | Done |
Esf0013 | White chars not allowed at the start of a line | Done |
Esf0014 | ##gff-version" missing from the first line | Done |
Esf0015 | Expecting certain fields in the feature | Done |
Esf0016 | ##sequence-region seqid may only appear once | Done |
Esf0017 | Start/End is not a valid integer | Done |
Esf0018 | Start is not less than or equal to end | Done |
Esf0019 | Version is not "3" | Done |
Esf0020 | Version is not a valid integer | Done |
Esf0021 | Unknown directive | Done |
Esf0022 | Features should contain 9 fields | Done |
Esf0023 | escape certain characters | Done |
Esf0024 | Score is not a valid floating point number | Done |
Esf0025 | Strand has illegal characters | Done |
Esf0026 | Phase is not 0, 1, or 2, or not a valid integer | Done |
Esf0027 | Phase is required for all CDS features | Done |
Esf0028 | Attributes must escape the percent (%) sign and any control characters | Done |
Esf0029 | Attributes must contain one and only one equal (=) sign | Done |
Esf0030 | Empty attribute tag | Done |
Esf0031 | Empty attribute value | Done |
Esf0032 | Found multiple attribute tags | Done |
Esf0033 | Found ", " in a attribute, possible unescaped | Done |
Esf0034 | attribute has identical values (count, value) | Done |
Esf0035 | attribute has unresolved forward reference | Done |
Esf0036 | Value of a attribute contains unescaped "," | Done |
Esf0037 | Target attribute should have 3 or 4 values | Done |
Esf0038 | Start/End value of Target attribute is not a valid integer coordinate | Done |
Esf0039 | Strand value of Target attribute has illegal characters | Done |
Esf0040 | Value of Is_circular attribute is not "true" | Done |
Esf0041 | Unknown reserved (uppercase) attribute | Done |