generate reports on differences in two gpad2.0 files #540

sierra-moxon · 2021-03-17T17:56:27Z

command line tool that takes two gpad files and produces a "diff" report between them.
high level summary statistics generated, ie: N genes had M new annotations.

kltm · 2021-03-17T17:58:25Z

ukemi · 2021-03-17T18:05:13Z

But realize that annotations_in != annotations_out for all groups. In some cases, incoming annotations will be split or deepened depending on the final procedure for creating GPADs from Noctua. For example: If MGI has an annotation to organ development_results in development of lung, I believe currently this will be deepened to lung development. If an MGI annotation has two pipe-delimited extensions, it will be split into two separate annotations. We need to talk about what to do with pipe-delimited 'with' fields. A better comparison, but way harder to do would be to be sure that the incoming GPAD file is semantically equivalent to the outgoing GPAD.

sierra-moxon · 2021-03-22T20:05:35Z

Note another gpaddiff tool was developed: https://github.com/geneontology/gocamgen/tree/master/gpaddiff (thanks for the pointer @dustine32!)

sierra-moxon · 2021-11-12T15:57:43Z

The current iteration of this tool, compares at the file level and attempts to compare at the semantic annotation level as well. It will be good to go over the results in an import meeting so we can see if its on the right track! :)

kltm · 2021-11-12T21:25:15Z

@sierra-moxon This is really coming along! I'm running it again and having a bit of fun.

Minor question: one of the group_by_column arguments is "evidence_code"; it this actually mapping back to evidence codes, or is it just evidence (which I think might make more sense in a world with GPADs)?
I think "subject" and "object" might be a bit odd for more casual users, both as input parameters and for output output. I might suggest GO term / bioentity or similar.

I would also advocate for a "cli" or "machine" output mode for those interested in using the results in automated processes (raises hand) and quick exploration of differences. It would be more actual results and less "reporting" (the counts report may be usable for this), so it would be easier to pipeline into grep or jq; it would also be nice to select one of the outputs for STDOUT (fitting with a lot of what I do).

sierra-moxon self-assigned this Mar 17, 2021

sierra-moxon changed the title ~~generate reports on differences in two gpad files~~ generate reports on differences in two gpad2.0 files Mar 17, 2021

sierra-moxon mentioned this issue Mar 17, 2021

Add functionality to support examining input and output annotation differences (GPAD 2.0 diffs?) geneontology/go-annotation#3687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate reports on differences in two gpad2.0 files #540

generate reports on differences in two gpad2.0 files #540

sierra-moxon commented Mar 17, 2021

kltm commented Mar 17, 2021

ukemi commented Mar 17, 2021

sierra-moxon commented Mar 22, 2021

sierra-moxon commented Nov 12, 2021

kltm commented Nov 12, 2021

generate reports on differences in two gpad2.0 files #540

generate reports on differences in two gpad2.0 files #540

Comments

sierra-moxon commented Mar 17, 2021

kltm commented Mar 17, 2021

ukemi commented Mar 17, 2021

sierra-moxon commented Mar 22, 2021

sierra-moxon commented Nov 12, 2021

kltm commented Nov 12, 2021