gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

sierra-moxon · 2021-10-13T23:35:02Z

Just a draft of how I am thinking about GPAD diff tools -- looking for feedback!

The tool sort of does two kinds of diffing:

Return a report with just basic file statistics (ie: number of lines in the file, option columns of the file to group by in order to return counts, etc.)
Since we know we have to compare GPAD 1.2 to GPAD 2.0, and the columns won't be the same, we need to also compare GoAnnotation objects themselves (controlled by the -ed click parameter). This part of the code checks for the 'degree' of matching of two annotation objects, because even though we compare GoAnnotation objects, even the metadata of those objects might prevent the two objects from being identical (not to mention the idea that one GoAnnotation in the source might be two GoAnnotations in the target).

How to run it (test files attached):

python gpaddiffer.py -file1 noctua_zfin.gpad1 -file2 noctua_zfin.gpad2  -o test.out -cb Evidence_type -ed False -file_type gpad

-gp1, -gp2 = first GPAD file (this one will be the source, gp2 will be the target, so reports are based on whether or not an annotation from gp1 is in gp2) and second GPAD file respectively. At the moment, assumes GPAD 1.2 for stats...

-cb = the count by column. the column name specified here is the same as in the GPAD file itself, can have as many group-by-columns from the GPAD as needed, each will provide counts per column in a separate output stanza.

-ed = exclude_details - if False, detailed comparison of GoAnnotations (vs. only reporting file statistics, ie: count of rows per evidence code, the total number of rows from each GPAD file, etc.) False is the default.

sample output, group by columns are 'DB_Object_ID' and 'Evidence_type'

Starting comparison 

test1.gpad
{'filename': 'test1.gpad', 'total_rows': 4, 'grouper': 'Evidence_type', 'grouped_reports': [DB_Object_ID
ZDB-GENE-030616-542    1
ZDB-GENE-050522-375    1
ZDB-GENE-081113-6      2
Name: DB_Object_ID, dtype: int64, Evidence_type
IEA    2
ISS    2
Name: Evidence_type, dtype: int64]}
test2.gpad
{'filename': 'test2.gpad', 'total_rows': 4, 'grouper': 'Evidence_type', 'grouped_reports': [DB_Object_ID
ZDB-GENE-030616-542    1
ZDB-GENE-050522-375    1
ZDB-GENE-071119-2      1
ZDB-GENE-081113-6      1
Name: DB_Object_ID, dtype: int64, Evidence_type
IEA    2
ISS    2
Name: Evidence_type, dtype: int64]}

total number of exact matches = 3
total number of close matches = 0
total number of lines processed = 4

test1.gpad.txt
test2.gpad.txt

…atch as error

sierra-moxon added 30 commits October 11, 2021 14:12

gpad diff stub from gocamgen

b4f5b18

add pandasql and gpad differ progress

4bdfe16

tweak output a bit, allow any column to be grouped

fbb4633

replace evidence codes with three letter terms, fix output

4cfb280

replace evidence codes with three letter terms, fix output

94eb158

incorporate Ben's changes to compare GoAssociation objects themselves

667a574

incorporate Ben's changes to compare GoAssociation objects themselves

a470f90

add click parameters to include/exclude details

8ea0f68

fix header

60a2592

add match score work into stats

327b1df

tweak output of stats for 1.2 GPAD

983ab3f

remove unused dependency

8334f6d

remove unused dependency

d83e32f

remove pandasql from requirements

87cacc2

add GAF support for diffs

d74f7df

add gaf support

800ce2d

fix match stats

28efe04

add report object with close match details

edb1eda

add report structure to report lines with only close as warning, no m…

9f5247c

…atch as error

add report structure to report lines with only close as warning, no m…

6784a33

…atch as error

add report structure to report lines with only close as warning, no m…

09703e5

…atch as error

add reporting

37474f6

fix tests

7d90676

fix tests

7c649dc

tweaking reports

a61a82e

tweaking reports

2d306e0

tweaking reports

7784f5a

tweaking reports

0f6af92

report to files.

118898a

write out

00f0a16

sierra-moxon added 29 commits November 10, 2021 14:06

remove old comparison code

9e5e9ac

add normalize of relation

a082fb5

fix reporting

2561123

fix string formatting bug

1d29ed9

fix string formatting bug

1ab3362

only show grouped by rows where they don't match eachother

32fdb2f

fixing formatting

679b90f

normalize identifiers in subject gpad 1.2

a59a0a1

rename file

6146ab7

tweak reporting format

df263de

formating results

e2d4a41

formating results

624c4cb

formating results

533cf23

formating results

5479b6f

formating results

8e74f21

formating results

93dce7d

formating results

faa51ed

add restrict_to_decreases

f72af2c

add restrict to decreases processing

db276ee

add gpad 1.1 version

93a79f1

restrict group by choices to evidence_code, subject, object

25c54e5

restrict group by choices to evidence_code, subject, object

87ce9dd

add function outputs

638fc21

add function outputs

c30c939

add some doc strings

e08c4dc

stash changes

405ea4d

Merge branch 'master' into gpad_differ

89d67ef

commit to run tests again

824890b

Merge branch 'master' into gpad_differ

fb37f9e

sierra-moxon merged commit 7257dfa into master Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

sierra-moxon commented Oct 13, 2021 •

edited

Loading

gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

gpad diff tool, calculate summary stats and group by columns, reports on exact and close matches as well as missing lines in one file compared to another. #594

Conversation

sierra-moxon commented Oct 13, 2021 • edited Loading

sierra-moxon commented Oct 13, 2021 •

edited

Loading