Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated NodeNorm preferred label to match Babel's #300

Merged
merged 24 commits into from
Nov 8, 2024

Conversation

gaurav
Copy link
Contributor

@gaurav gaurav commented Nov 4, 2024

I changed Babel's preferred name lookup algorithm a while ago (TranslatorSRI/Babel#330), but I didn't change NodeNorm's preferred name lookup at the same time. This PR updates the prefix boost order to match Babel's and updates the algorithm to match Babel's as closely as possible.

Babel's algorithm does something significantly different from what NodeNorm's algorithm tries to do: Babel's algorithm for generating conflated synonym files uses the preferred name algorithm to find the best name for each unconflated clique, then picks the first preferred name when conflating multiple cliques; however, by the time we get to the create_node() code in NodeNorm, we've lost track of what the subcliques are, so instead we just run the "preferred label" algorithm on all the labels within the conflated clique and hope for the best.

This PR modifies NodeNorm to try to replicate Babel's algorithm: although we lose track of the subcliques, when we know that we're dealing with a conflation, we can walk through all the identifiers one-by-one and try to find a subclique with at least one non-empty label. We use a set() to ensure that this is as efficient as possible.

Ultimately, we should get rid of even this simplified code (#299) and just read the preferred name calculated by Babel for every clique, which is now present in the NodeNorm output files. And I don't think we'll hit the worst-case performance very often.

@gaurav gaurav requested a review from cbizon November 4, 2024 22:38
@gaurav gaurav removed the request for review from cbizon November 7, 2024 15:56
@gaurav gaurav requested a review from cbizon November 7, 2024 16:25
@gaurav gaurav merged commit 73b56dc into master Nov 8, 2024
@gaurav gaurav deleted the sync-nodenorm-label-with-babel branch November 8, 2024 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant