Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate reports on differences in two gpad2.0 files #540

Open
2 tasks
sierra-moxon opened this issue Mar 17, 2021 · 5 comments
Open
2 tasks

generate reports on differences in two gpad2.0 files #540

sierra-moxon opened this issue Mar 17, 2021 · 5 comments
Assignees

Comments

@sierra-moxon
Copy link
Member

  • command line tool that takes two gpad files and produces a "diff" report between them.
  • high level summary statistics generated, ie: N genes had M new annotations.
@sierra-moxon sierra-moxon self-assigned this Mar 17, 2021
@kltm
Copy link
Member

kltm commented Mar 17, 2021

Tagging @ukemi @vanaukenk

@sierra-moxon sierra-moxon changed the title generate reports on differences in two gpad files generate reports on differences in two gpad2.0 files Mar 17, 2021
@ukemi
Copy link

ukemi commented Mar 17, 2021

But realize that annotations_in != annotations_out for all groups. In some cases, incoming annotations will be split or deepened depending on the final procedure for creating GPADs from Noctua. For example: If MGI has an annotation to organ development_results in development of lung, I believe currently this will be deepened to lung development. If an MGI annotation has two pipe-delimited extensions, it will be split into two separate annotations. We need to talk about what to do with pipe-delimited 'with' fields. A better comparison, but way harder to do would be to be sure that the incoming GPAD file is semantically equivalent to the outgoing GPAD.

@sierra-moxon
Copy link
Member Author

Note another gpaddiff tool was developed: https://github.com/geneontology/gocamgen/tree/master/gpaddiff (thanks for the pointer @dustine32!)

@sierra-moxon
Copy link
Member Author

The current iteration of this tool, compares at the file level and attempts to compare at the semantic annotation level as well. It will be good to go over the results in an import meeting so we can see if its on the right track! :)

@kltm
Copy link
Member

kltm commented Nov 12, 2021

@sierra-moxon This is really coming along! I'm running it again and having a bit of fun.

Minor question: one of the group_by_column arguments is "evidence_code"; it this actually mapping back to evidence codes, or is it just evidence (which I think might make more sense in a world with GPADs)?
I think "subject" and "object" might be a bit odd for more casual users, both as input parameters and for output output. I might suggest GO term / bioentity or similar.

I would also advocate for a "cli" or "machine" output mode for those interested in using the results in automated processes (raises hand) and quick exploration of differences. It would be more actual results and less "reporting" (the counts report may be usable for this), so it would be easier to pipeline into grep or jq; it would also be nice to select one of the outputs for STDOUT (fitting with a lot of what I do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants