You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This means we need to manually keep this in sync with Babel.
As we redesign NodeNorm's database to add clique-level properties (#302), we should see if we can use the preferred name from the compendium files instead of having to recalculate them here.
The text was updated successfully, but these errors were encountered:
…abel
I changed Babel's preferred name lookup algorithm a while ago (TranslatorSRI/Babel#330), but I didn't change NodeNorm's preferred name lookup at the same time. This PR updates the prefix boost order to match Babel's and updates the algorithm to match Babel's as closely as possible.
Babel's algorithm does something significantly different from what NodeNorm's algorithm tries to do: Babel's algorithm for generating conflated synonym files uses the preferred name algorithm to find the best name for each unconflated clique, then picks the first preferred name when conflating multiple cliques; however, by the time we get to the create_node() code in NodeNorm, we've lost track of what the subcliques are, so instead we just run the "preferred label" algorithm on all the labels within the conflated clique and hope for the best.
This PR modifies NodeNorm to try to replicate Babel's algorithm: although we lose track of the subcliques, when we know that we're dealing with a conflation, we can walk through all the identifiers one-by-one and try to find a subclique with at least one non-empty label. We use a set() to ensure that this is as efficient as possible.
Ultimately, we should get rid of even this simplified code (#299) and just read the preferred name calculated by Babel for every clique, which is now present in the NodeNorm output files. And I don't think we'll hit the worst-case performance very often.
NodeNorm has always calculated the preferred name for a clique itself, even though we need to calculate this in Babel so we can use it in NameRes:
NodeNormalization/node_normalizer/normalizer.py
Lines 707 to 736 in 13ba01e
This means we need to manually keep this in sync with Babel.
As we redesign NodeNorm's database to add clique-level properties (#302), we should see if we can use the preferred name from the compendium files instead of having to recalculate them here.
The text was updated successfully, but these errors were encountered: