You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under the recommendations for the type field, you say:
Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).
I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.
Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).
In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region (which I chose based on SO:0001411), and then deviated from SO by calling any interval in that grouping a feature_interval. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion.
The text was updated successfully, but these errors were encountered:
Hi @ifiddes thank you for your comment.
Currently, the focus of these recommendation is on protein-coding genes. The point here is a general recommendation to just use “gene” and “pseudogene” in column 3 for genes, and provide additional granularity of gene types in column 9, as opposed to saying protein_coding_gene in column 3. Properly parsing the broader scope of SO types that can be represented in GFF3 requires using the SO hierarchy. While I understand the challenges posed by using a wide range of terms in column 3, I believe calling everything “biological_region” would be a huge generalization, and force ad hoc processing of non-standard attributes in column 9 to make use of the rich annotation.
Under the recommendations for the
type
field, you say:I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.
Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).
In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a
biological region
(which I chose based onSO:0001411
), and then deviated from SO by calling any interval in that grouping afeature_interval
. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) asubregion
.The text was updated successfully, but these errors were encountered: