-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just a draft of how I am thinking about GPAD diff tools -- looking for feedback!
The tool sort of does two kinds of diffing:
Return a report with just basic file statistics (ie: number of lines in the file, option columns of the file to group by in order to return counts, etc.)
Since we know we have to compare GPAD 1.2 to GPAD 2.0, and the columns won't be the same, we need to also compare GoAnnotation objects themselves (controlled by the -ed click parameter). This part of the code checks for the 'degree' of matching of two annotation objects, because even though we compare GoAnnotation objects, even the metadata of those objects might prevent the two objects from being identical (not to mention the idea that one GoAnnotation in the source might be two GoAnnotations in the target).
How to run it (test files attached):
-gp1, -gp2 = first GPAD file (this one will be the source, gp2 will be the target, so reports are based on whether or not an annotation from gp1 is in gp2) and second GPAD file respectively. At the moment, assumes GPAD 1.2 for stats...
-cb = the count by column. the column name specified here is the same as in the GPAD file itself, can have as many group-by-columns from the GPAD as needed, each will provide counts per column in a separate output stanza.
-ed = exclude_details - if False, detailed comparison of GoAnnotations (vs. only reporting file statistics, ie: count of rows per evidence code, the total number of rows from each GPAD file, etc.) False is the default.
sample output, group by columns are 'DB_Object_ID' and 'Evidence_type'
test1.gpad.txt
test2.gpad.txt