Modeling non-gene features as top-level `type` fields #8

ifiddes · 2022-02-23T15:43:35Z

Under the recommendations for the type field, you say:

Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).

I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.

Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).

In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region (which I chose based on SO:0001411), and then deviated from SO by calling any interval in that grouping a feature_interval. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion.

The text was updated successfully, but these errors were encountered:

vkkodali · 2022-03-15T18:29:08Z

Hi @ifiddes thank you for your comment.
Currently, the focus of these recommendation is on protein-coding genes. The point here is a general recommendation to just use “gene” and “pseudogene” in column 3 for genes, and provide additional granularity of gene types in column 9, as opposed to saying protein_coding_gene in column 3. Properly parsing the broader scope of SO types that can be represented in GFF3 requires using the SO hierarchy. While I understand the challenges posed by using a wide range of terms in column 3, I believe calling everything “biological_region” would be a huge generalization, and force ad hoc processing of non-standard attributes in column 9 to make use of the rich annotation.

mpoelchau assigned vkkodali Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modeling non-gene features as top-level `type` fields #8

Modeling non-gene features as top-level `type` fields #8

ifiddes commented Feb 23, 2022

vkkodali commented Mar 15, 2022

Modeling non-gene features as top-level type fields #8

Modeling non-gene features as top-level type fields #8

Comments

ifiddes commented Feb 23, 2022

vkkodali commented Mar 15, 2022

Modeling non-gene features as top-level `type` fields #8

Modeling non-gene features as top-level `type` fields #8