Update of Arabidopsis gene-GO file #16

nicomaper · 2023-10-10T13:01:56Z

The current Arabidopsis gene-GO file is missing gene-GO pairs that come from ‘high throughput’ evidence code (HTP). It should be updated using the gene-GO file from PLAZA 5.0

hdbeukel · 2023-10-10T19:15:37Z

There are actually even more evidence types missing (see GO website):
Inferred from High Throughput Experiment (HTP)
Inferred from High Throughput Direct Assay (HDA)
Inferred from High Throughput Mutant Phenotype (HMP)
Inferred from High Throughput Genetic Interaction (HGI)
Inferred from High Throughput Expression Pattern (HEP)

hdbeukel · 2023-10-10T19:33:28Z

This is the reprocessed TAIR10 annotation file (BP, curated and experimental annotations only, extended to parental terms), now including high-throughput experimental annotations: ath_BP_cur_exp_extended_tair10.txt (392.957 annotations)

The respective annotations as processed from PLAZA: ath_BP_cur_exp_extended_plaza.txt (394.411 annotations)

As you can see they do differ a bit. As expected, PLAZA contains annotations that were missing in TAIR10, but the reverse is also true. Ignoring the specific evidence types, there are 358.530 (~90%) annotations in common between PLAZA and TAIR10. The number of specific annotations present in one set but not in the other, is summarised in the table below.

# Specific annotations	ATXXX ids	non-ATXXX ids
PLAZA	35.881	0
TAIR10	25.443	8.984

We argued that not having the ~9k non-ATXXX ids that were unique to TAIR10 was desired, but what about the >25k ATXXX gene annotations that are unique to TAIR10? Should we include these as well, in addition to the PLAZA annotations?

nicomaper · 2023-10-11T07:40:26Z

Alright, but maybe first we should find out why they are not in PLAZA, because maybe there is a reason for that. Perhaps it is just that the TAIR annotation has been updated after the PLAZA release, in which case I would be in favor of adding them, but maybe there was another reason (quality, etc.). Knowing that would be important to make a decision on whether to include them or not.

hdbeukel · 2023-10-13T09:25:35Z

Ok so we decided to include all PLAZA annotations and the ATxGxxx gene annotations from TAIR10 that were not in PLAZA. As the PLAZA v5 data has been generated about three years ago, the missing annotations are likely new annotations.

This would be the new annotation file for Arabidopsis: ath_go_gene_file.txt. @nicomaper can you check it before I make a pull request?

Data has been extended to parental terms and filtered for:

BP only
Experimental and curator/authored evidence codes only

In case of duplicate annotations (same gene, same GO term) only the one with the highest priority (most relevant) evidence code has been retained (exp > cur).

hdbeukel · 2023-10-13T19:01:44Z

As discussed I will reprocess the file to remove GO terms with over 1.000 annotated genes, to avoid testing for enrichment of very general terms.

hdbeukel · 2023-10-16T08:55:19Z

@nicomaper after filtering the file to retain only annotations with less than 1.000 genes: ath_go_gene_file.txt.

hdbeukel · 2023-10-17T07:32:09Z

Now also removed obsolete ids. If the GO tree provided a replaced_by then the obsolete id has been replaced with the other id, else it has been discarded.

Final go-gene file: ath_go_gene_file.txt. Includes PLAZA 5 annotations + TAIR10 ATXGXXX annotations not found in PLAZA.

Final applied filtering:

BP only
Experimental and curated evidence:
- in order of increasing number of annotations: EXP, HDA, IC, IPI, HEP, NAS, IEP, TAS, IGI, IDA, IMP
Discarded/replaced obsolete ids
Replaced alternate ids with corresponding primary id
Propagated to parental terms (extended)
Removed duplicate annotations
Very general annotations were discarded (GO terms with at least 1.000 annotated genes, after propagation)

hdbeukel · 2023-10-23T17:34:20Z

After further discussion we decided to keep all GO terms (except the BP root) in the annotation file, updated file: ath_go_gene_file.txt.

Other properties have not changed (see above).

We will further investigate to exclude generic terms from enrichment testing when performing the actual analysis, for this new options will be added to enricher.

nicomaper added the data update label Oct 10, 2023

nicomaper assigned hdbeukel and nicomaper Oct 10, 2023

nicomaper mentioned this issue Nov 22, 2023

GO enrichment integral update #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update of Arabidopsis gene-GO file #16

Update of Arabidopsis gene-GO file #16

nicomaper commented Oct 10, 2023

hdbeukel commented Oct 10, 2023

hdbeukel commented Oct 10, 2023 •

edited

Loading

nicomaper commented Oct 11, 2023

hdbeukel commented Oct 13, 2023

hdbeukel commented Oct 13, 2023

hdbeukel commented Oct 16, 2023

hdbeukel commented Oct 17, 2023 •

edited

Loading

hdbeukel commented Oct 23, 2023

Update of Arabidopsis gene-GO file #16

Update of Arabidopsis gene-GO file #16

Comments

nicomaper commented Oct 10, 2023

hdbeukel commented Oct 10, 2023

hdbeukel commented Oct 10, 2023 • edited Loading

nicomaper commented Oct 11, 2023

hdbeukel commented Oct 13, 2023

hdbeukel commented Oct 13, 2023

hdbeukel commented Oct 16, 2023

hdbeukel commented Oct 17, 2023 • edited Loading

hdbeukel commented Oct 23, 2023

hdbeukel commented Oct 10, 2023 •

edited

Loading

hdbeukel commented Oct 17, 2023 •

edited

Loading