Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

Merged
merged 77 commits into from
Dec 1, 2021

Conversation

sierra-moxon
Copy link
Member

@sierra-moxon sierra-moxon commented Oct 13, 2021

Just a draft of how I am thinking about GPAD diff tools -- looking for feedback!

The tool sort of does two kinds of diffing:

  1. Return a report with just basic file statistics (ie: number of lines in the file, option columns of the file to group by in order to return counts, etc.)

  2. Since we know we have to compare GPAD 1.2 to GPAD 2.0, and the columns won't be the same, we need to also compare GoAnnotation objects themselves (controlled by the -ed click parameter). This part of the code checks for the 'degree' of matching of two annotation objects, because even though we compare GoAnnotation objects, even the metadata of those objects might prevent the two objects from being identical (not to mention the idea that one GoAnnotation in the source might be two GoAnnotations in the target).

How to run it (test files attached):

python gpaddiffer.py -file1 noctua_zfin.gpad1 -file2 noctua_zfin.gpad2  -o test.out -cb Evidence_type -ed False -file_type gpad

-gp1, -gp2 = first GPAD file (this one will be the source, gp2 will be the target, so reports are based on whether or not an annotation from gp1 is in gp2) and second GPAD file respectively. At the moment, assumes GPAD 1.2 for stats...

-cb = the count by column. the column name specified here is the same as in the GPAD file itself, can have as many group-by-columns from the GPAD as needed, each will provide counts per column in a separate output stanza.

-ed = exclude_details - if False, detailed comparison of GoAnnotations (vs. only reporting file statistics, ie: count of rows per evidence code, the total number of rows from each GPAD file, etc.) False is the default.

sample output, group by columns are 'DB_Object_ID' and 'Evidence_type'

Starting comparison 

test1.gpad
{'filename': 'test1.gpad', 'total_rows': 4, 'grouper': 'Evidence_type', 'grouped_reports': [DB_Object_ID
ZDB-GENE-030616-542    1
ZDB-GENE-050522-375    1
ZDB-GENE-081113-6      2
Name: DB_Object_ID, dtype: int64, Evidence_type
IEA    2
ISS    2
Name: Evidence_type, dtype: int64]}
test2.gpad
{'filename': 'test2.gpad', 'total_rows': 4, 'grouper': 'Evidence_type', 'grouped_reports': [DB_Object_ID
ZDB-GENE-030616-542    1
ZDB-GENE-050522-375    1
ZDB-GENE-071119-2      1
ZDB-GENE-081113-6      1
Name: DB_Object_ID, dtype: int64, Evidence_type
IEA    2
ISS    2
Name: Evidence_type, dtype: int64]}

total number of exact matches = 3
total number of close matches = 0
total number of lines processed = 4

test1.gpad.txt
test2.gpad.txt

@sierra-moxon sierra-moxon merged commit 7257dfa into master Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants