Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NodeNorm's code for determining the preferred name for a clique #299

Open
gaurav opened this issue Nov 4, 2024 · 0 comments
Open

Comments

@gaurav
Copy link
Contributor

gaurav commented Nov 4, 2024

NodeNorm has always calculated the preferred name for a clique itself, even though we need to calculate this in Babel so we can use it in NameRes:

# As per https://github.com/TranslatorSRI/Babel/issues/158, we select the first label from any
# identifier _except_ where one of the types is in preferred_name_boost_prefixes, in which case
# we prefer the prefixes listed there.
labels = list(filter(lambda x: len(x) > 0, [eid['l'] for eid in eids if 'l' in eid]))
# Note that types[canonical_id] goes from most specific to least specific, so we
# need to reverse it in order to apply preferred_name_boost_prefixes for the most
# specific type.
for typ in types[canonical_id][::-1]:
if typ in config['preferred_name_boost_prefixes']:
# This is the most specific matching type, so we use this.
labels = map(lambda identifier: identifier.get('l', ''),
sort_identifiers_with_boosted_prefixes(
eids,
config['preferred_name_boost_prefixes'][typ]
))
break
# Filter out unsuitable labels.
labels = [l for l in labels if
l and # Ignore blank or empty names.
not l.startswith('CHEMBL') # Some CHEMBL names are just the identifier again.
]
# Note that the id will be from the equivalent ids, not the canonical_id. This is to handle conflation
if len(labels) > 0:
node = {"id": {"identifier": eids[0]['i'], "label": labels[0]}}
else:
# Sometimes, nothing has a label :(
node = {"id": {"identifier": eids[0]['i']}}

This means we need to manually keep this in sync with Babel.

As we redesign NodeNorm's database to add clique-level properties (#302), we should see if we can use the preferred name from the compendium files instead of having to recalculate them here.

@gaurav gaurav added this to the NodeNorm database redesign milestone Nov 4, 2024
gaurav added a commit that referenced this issue Nov 8, 2024
…abel

I changed Babel's preferred name lookup algorithm a while ago (TranslatorSRI/Babel#330), but I didn't change NodeNorm's preferred name lookup at the same time. This PR updates the prefix boost order to match Babel's and updates the algorithm to match Babel's as closely as possible.

Babel's algorithm does something significantly different from what NodeNorm's algorithm tries to do: Babel's algorithm for generating conflated synonym files uses the preferred name algorithm to find the best name for each unconflated clique, then picks the first preferred name when conflating multiple cliques; however, by the time we get to the create_node() code in NodeNorm, we've lost track of what the subcliques are, so instead we just run the "preferred label" algorithm on all the labels within the conflated clique and hope for the best.

This PR modifies NodeNorm to try to replicate Babel's algorithm: although we lose track of the subcliques, when we know that we're dealing with a conflation, we can walk through all the identifiers one-by-one and try to find a subclique with at least one non-empty label. We use a set() to ensure that this is as efficient as possible.

Ultimately, we should get rid of even this simplified code (#299) and just read the preferred name calculated by Babel for every clique, which is now present in the NodeNorm output files. And I don't think we'll hit the worst-case performance very often.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant