Improve prefix checking #359

gaurav · 2024-10-16T04:40:50Z

We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in reports/Gene.txt are:

frozenset({('RGD', 1), ('NCBIGENE', 1)})	22410
frozenset({('RGD', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})	24805
frozenset({('MGI', 1), ('NCBIGENE', 1)})	25522
frozenset({('NCBIGENE', 1), ('WORMBASE', 1)})	28785
frozenset({('MGI', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})	32732
frozenset({('ENSEMBL', 1)})	3364683
frozenset({('ENSEMBL', 1), ('NCBIGENE', 1)})	11464913
frozenset({('NCBIGENE', 1)})	44080138

Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.

This issue proposes another way of getting this information:

We create a JSON file as a dictionary with every prefix in it.
For every prefix, we determine:
The total number of CURIEs in the system with that prefix (total_curies)
The total number of unique CURIEs in the system with that prefix (total_unique_curies -- if not identical to total_curies, it means some duplication is going on).
The total number of cliques containing this prefix (total_cliques_containing_prefix -- if equal to total_unique_curies, then there's exactly one identifier in each clique)
The files this prefix is present in

We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.

The text was updated successfully, but these errors were encountered:

gaurav · 2024-10-17T21:02:35Z

Here's where we're at as of 664f0d2 (in PR #363): prefix_report.json

The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.

gaurav · 2024-10-18T04:44:01Z

Fixed some bugs and here's where we're at as of ba62bd9 (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).

gaurav · 2024-10-18T22:46:42Z

Ooo, for every for_clique/by_file entry, we should count how often we get the LEADER and how often we get a SECONDARY ID with the same prefix. Tricky, but should be doable.

gaurav mentioned this issue Oct 17, 2024

Check duplicate prefixes #363

Draft

gaurav added this to the Babel November 2024 milestone Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve prefix checking #359

Improve prefix checking #359

gaurav commented Oct 16, 2024

gaurav commented Oct 17, 2024 •

edited

Loading

gaurav commented Oct 18, 2024

gaurav commented Oct 18, 2024

Improve prefix checking #359

Improve prefix checking #359

Comments

gaurav commented Oct 16, 2024

gaurav commented Oct 17, 2024 • edited Loading

gaurav commented Oct 18, 2024

gaurav commented Oct 18, 2024

gaurav commented Oct 17, 2024 •

edited

Loading