Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prefix checking #359

Open
gaurav opened this issue Oct 16, 2024 · 3 comments
Open

Improve prefix checking #359

gaurav opened this issue Oct 16, 2024 · 3 comments

Comments

@gaurav
Copy link
Collaborator

gaurav commented Oct 16, 2024

We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in reports/Gene.txt are:

frozenset({('RGD', 1), ('NCBIGENE', 1)})	22410
frozenset({('RGD', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})	24805
frozenset({('MGI', 1), ('NCBIGENE', 1)})	25522
frozenset({('NCBIGENE', 1), ('WORMBASE', 1)})	28785
frozenset({('MGI', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})	32732
frozenset({('ENSEMBL', 1)})	3364683
frozenset({('ENSEMBL', 1), ('NCBIGENE', 1)})	11464913
frozenset({('NCBIGENE', 1)})	44080138

Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.

This issue proposes another way of getting this information:

  • We create a JSON file as a dictionary with every prefix in it.
  • For every prefix, we determine:
  • The total number of CURIEs in the system with that prefix (total_curies)
  • The total number of unique CURIEs in the system with that prefix (total_unique_curies -- if not identical to total_curies, it means some duplication is going on).
  • The total number of cliques containing this prefix (total_cliques_containing_prefix -- if equal to total_unique_curies, then there's exactly one identifier in each clique)
  • The files this prefix is present in

We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.

@gaurav
Copy link
Collaborator Author

gaurav commented Oct 17, 2024

Here's where we're at as of 664f0d2 (in PR #363): prefix_report.json

The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.

@gaurav
Copy link
Collaborator Author

gaurav commented Oct 18, 2024

Fixed some bugs and here's where we're at as of ba62bd9 (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).

@gaurav
Copy link
Collaborator Author

gaurav commented Oct 18, 2024

  • Ooo, for every for_clique/by_file entry, we should count how often we get the LEADER and how often we get a SECONDARY ID with the same prefix. Tricky, but should be doable.

@gaurav gaurav added this to the Babel November 2024 milestone Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant