Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modeling non-gene features as top-level type fields #8

Open
ifiddes opened this issue Feb 23, 2022 · 1 comment
Open

Modeling non-gene features as top-level type fields #8

ifiddes opened this issue Feb 23, 2022 · 1 comment
Assignees

Comments

@ifiddes
Copy link

ifiddes commented Feb 23, 2022

Under the recommendations for the type field, you say:

Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).

I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.

Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).

In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region (which I chose based on SO:0001411), and then deviated from SO by calling any interval in that grouping a feature_interval. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion.

@vkkodali
Copy link
Collaborator

Hi @ifiddes thank you for your comment.
Currently, the focus of these recommendation is on protein-coding genes. The point here is a general recommendation to just use “gene” and “pseudogene” in column 3 for genes, and provide additional granularity of gene types in column 9, as opposed to saying protein_coding_gene in column 3. Properly parsing the broader scope of SO types that can be represented in GFF3 requires using the SO hierarchy. While I understand the challenges posed by using a wide range of terms in column 3, I believe calling everything “biological_region” would be a huge generalization, and force ad hoc processing of non-standard attributes in column 9 to make use of the rich annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants