-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update of Arabidopsis gene-GO file #16
Comments
This is the reprocessed TAIR10 annotation file (BP, curated and experimental annotations only, extended to parental terms), now including high-throughput experimental annotations: ath_BP_cur_exp_extended_tair10.txt (392.957 annotations) The respective annotations as processed from PLAZA: ath_BP_cur_exp_extended_plaza.txt (394.411 annotations) As you can see they do differ a bit. As expected, PLAZA contains annotations that were missing in TAIR10, but the reverse is also true. Ignoring the specific evidence types, there are 358.530 (~90%) annotations in common between PLAZA and TAIR10. The number of specific annotations present in one set but not in the other, is summarised in the table below.
We argued that not having the ~9k non-ATXXX ids that were unique to TAIR10 was desired, but what about the >25k ATXXX gene annotations that are unique to TAIR10? Should we include these as well, in addition to the PLAZA annotations? |
Alright, but maybe first we should find out why they are not in PLAZA, because maybe there is a reason for that. Perhaps it is just that the TAIR annotation has been updated after the PLAZA release, in which case I would be in favor of adding them, but maybe there was another reason (quality, etc.). Knowing that would be important to make a decision on whether to include them or not. |
Ok so we decided to include all PLAZA annotations and the ATxGxxx gene annotations from TAIR10 that were not in PLAZA. As the PLAZA v5 data has been generated about three years ago, the missing annotations are likely new annotations. This would be the new annotation file for Arabidopsis: ath_go_gene_file.txt. @nicomaper can you check it before I make a pull request? Data has been extended to parental terms and filtered for:
In case of duplicate annotations (same gene, same GO term) only the one with the highest priority (most relevant) evidence code has been retained (exp > cur). |
As discussed I will reprocess the file to remove GO terms with over 1.000 annotated genes, to avoid testing for enrichment of very general terms. |
@nicomaper after filtering the file to retain only annotations with less than 1.000 genes: ath_go_gene_file.txt. |
Now also removed obsolete ids. If the GO tree provided a Final go-gene file: ath_go_gene_file.txt. Includes PLAZA 5 annotations + TAIR10 ATXGXXX annotations not found in PLAZA. Final applied filtering:
|
After further discussion we decided to keep all GO terms (except the BP root) in the annotation file, updated file: ath_go_gene_file.txt. Other properties have not changed (see above). We will further investigate to exclude generic terms from enrichment testing when performing the actual analysis, for this new options will be added to enricher. |
The current Arabidopsis gene-GO file is missing gene-GO pairs that come from ‘high throughput’ evidence code (HTP). It should be updated using the gene-GO file from PLAZA 5.0
The text was updated successfully, but these errors were encountered: