You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in reports/Gene.txt are:
Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.
This issue proposes another way of getting this information:
We create a JSON file as a dictionary with every prefix in it.
For every prefix, we determine:
The total number of CURIEs in the system with that prefix (total_curies)
The total number of unique CURIEs in the system with that prefix (total_unique_curies -- if not identical to total_curies, it means some duplication is going on).
The total number of cliques containing this prefix (total_cliques_containing_prefix -- if equal to total_unique_curies, then there's exactly one identifier in each clique)
The files this prefix is present in
We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.
The text was updated successfully, but these errors were encountered:
The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.
Fixed some bugs and here's where we're at as of ba62bd9 (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).
Ooo, for every for_clique/by_file entry, we should count how often we get the LEADER and how often we get a SECONDARY ID with the same prefix. Tricky, but should be doable.
We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in
reports/Gene.txt
are:Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.
This issue proposes another way of getting this information:
We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.
The text was updated successfully, but these errors were encountered: