Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of Arabidopsis gene-GO file #16

Open
nicomaper opened this issue Oct 10, 2023 · 8 comments
Open

Update of Arabidopsis gene-GO file #16

nicomaper opened this issue Oct 10, 2023 · 8 comments
Assignees

Comments

@nicomaper
Copy link
Collaborator

The current Arabidopsis gene-GO file is missing gene-GO pairs that come from ‘high throughput’ evidence code (HTP). It should be updated using the gene-GO file from PLAZA 5.0

@hdbeukel
Copy link
Collaborator

@hdbeukel
Copy link
Collaborator

hdbeukel commented Oct 10, 2023

This is the reprocessed TAIR10 annotation file (BP, curated and experimental annotations only, extended to parental terms), now including high-throughput experimental annotations: ath_BP_cur_exp_extended_tair10.txt (392.957 annotations)

The respective annotations as processed from PLAZA: ath_BP_cur_exp_extended_plaza.txt (394.411 annotations)

As you can see they do differ a bit. As expected, PLAZA contains annotations that were missing in TAIR10, but the reverse is also true. Ignoring the specific evidence types, there are 358.530 (~90%) annotations in common between PLAZA and TAIR10. The number of specific annotations present in one set but not in the other, is summarised in the table below.

# Specific annotations ATXXX ids non-ATXXX ids
PLAZA 35.881 0
TAIR10 25.443 8.984

We argued that not having the ~9k non-ATXXX ids that were unique to TAIR10 was desired, but what about the >25k ATXXX gene annotations that are unique to TAIR10? Should we include these as well, in addition to the PLAZA annotations?

@nicomaper
Copy link
Collaborator Author

Alright, but maybe first we should find out why they are not in PLAZA, because maybe there is a reason for that. Perhaps it is just that the TAIR annotation has been updated after the PLAZA release, in which case I would be in favor of adding them, but maybe there was another reason (quality, etc.). Knowing that would be important to make a decision on whether to include them or not.

@hdbeukel
Copy link
Collaborator

Ok so we decided to include all PLAZA annotations and the ATxGxxx gene annotations from TAIR10 that were not in PLAZA. As the PLAZA v5 data has been generated about three years ago, the missing annotations are likely new annotations.

This would be the new annotation file for Arabidopsis: ath_go_gene_file.txt. @nicomaper can you check it before I make a pull request?

Data has been extended to parental terms and filtered for:

  • BP only
  • Experimental and curator/authored evidence codes only

In case of duplicate annotations (same gene, same GO term) only the one with the highest priority (most relevant) evidence code has been retained (exp > cur).

@hdbeukel
Copy link
Collaborator

As discussed I will reprocess the file to remove GO terms with over 1.000 annotated genes, to avoid testing for enrichment of very general terms.

@hdbeukel
Copy link
Collaborator

@nicomaper after filtering the file to retain only annotations with less than 1.000 genes: ath_go_gene_file.txt.

@hdbeukel
Copy link
Collaborator

hdbeukel commented Oct 17, 2023

Now also removed obsolete ids. If the GO tree provided a replaced_by then the obsolete id has been replaced with the other id, else it has been discarded.

Final go-gene file: ath_go_gene_file.txt. Includes PLAZA 5 annotations + TAIR10 ATXGXXX annotations not found in PLAZA.

Final applied filtering:

  • BP only
  • Experimental and curated evidence:
    • in order of increasing number of annotations: EXP, HDA, IC, IPI, HEP, NAS, IEP, TAS, IGI, IDA, IMP
  • Discarded/replaced obsolete ids
  • Replaced alternate ids with corresponding primary id
  • Propagated to parental terms (extended)
  • Removed duplicate annotations
  • Very general annotations were discarded (GO terms with at least 1.000 annotated genes, after propagation)

@hdbeukel
Copy link
Collaborator

After further discussion we decided to keep all GO terms (except the BP root) in the annotation file, updated file: ath_go_gene_file.txt.

Other properties have not changed (see above).

We will further investigate to exclude generic terms from enrichment testing when performing the actual analysis, for this new options will be added to enricher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants