-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated NodeNorm preferred label to match Babel's #300
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We use the order in TranslatorSRI/Babel#330
This reverts commit 6e8d1fa.
This will bring it in line with Babel.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I changed Babel's preferred name lookup algorithm a while ago (TranslatorSRI/Babel#330), but I didn't change NodeNorm's preferred name lookup at the same time. This PR updates the prefix boost order to match Babel's and updates the algorithm to match Babel's as closely as possible.
Babel's algorithm does something significantly different from what NodeNorm's algorithm tries to do: Babel's algorithm for generating conflated synonym files uses the preferred name algorithm to find the best name for each unconflated clique, then picks the first preferred name when conflating multiple cliques; however, by the time we get to the create_node() code in NodeNorm, we've lost track of what the subcliques are, so instead we just run the "preferred label" algorithm on all the labels within the conflated clique and hope for the best.
This PR modifies NodeNorm to try to replicate Babel's algorithm: although we lose track of the subcliques, when we know that we're dealing with a conflation, we can walk through all the identifiers one-by-one and try to find a subclique with at least one non-empty label. We use a set() to ensure that this is as efficient as possible.
Ultimately, we should get rid of even this simplified code (#299) and just read the preferred name calculated by Babel for every clique, which is now present in the NodeNorm output files. And I don't think we'll hit the worst-case performance very often.