Remove NodeNorm's code for determining the preferred name for a clique #299

gaurav · 2024-11-04T22:35:32Z

NodeNorm has always calculated the preferred name for a clique itself, even though we need to calculate this in Babel so we can use it in NameRes:

NodeNormalization/node_normalizer/normalizer.py

Lines 707 to 736 in 13ba01e

    
           # As per https://github.com/TranslatorSRI/Babel/issues/158, we select the first label from any 
        
           # identifier _except_ where one of the types is in preferred_name_boost_prefixes, in which case 
        
           # we prefer the prefixes listed there. 
        
           labels = list(filter(lambda x: len(x) > 0, [eid['l'] for eid in eids if 'l' in eid])) 
        
           # Note that types[canonical_id] goes from most specific to least specific, so we 
        
           # need to reverse it in order to apply preferred_name_boost_prefixes for the most 
        
           # specific type. 
        
           for typ in types[canonical_id][::-1]: 
        
               if typ in config['preferred_name_boost_prefixes']: 
        
                   # This is the most specific matching type, so we use this. 
        
                   labels = map(lambda identifier: identifier.get('l', ''), 
        
                                         sort_identifiers_with_boosted_prefixes( 
        
                                             eids, 
        
                                             config['preferred_name_boost_prefixes'][typ] 
        
                                         )) 
        
                   break 
        
           # Filter out unsuitable labels. 
        
           labels = [l for l in labels if 
        
                     l and                               # Ignore blank or empty names. 
        
                     not l.startswith('CHEMBL')          # Some CHEMBL names are just the identifier again. 
        
                     ] 
        
           # Note that the id will be from the equivalent ids, not the canonical_id.  This is to handle conflation 
        
           if len(labels) > 0: 
        
               node = {"id": {"identifier": eids[0]['i'], "label": labels[0]}} 
        
           else: 
        
               # Sometimes, nothing has a label :( 
        
               node = {"id": {"identifier": eids[0]['i']}}

This means we need to manually keep this in sync with Babel.

As we redesign NodeNorm's database to add clique-level properties (#302), we should see if we can use the preferred name from the compendium files instead of having to recalculate them here.

…abel I changed Babel's preferred name lookup algorithm a while ago (TranslatorSRI/Babel#330), but I didn't change NodeNorm's preferred name lookup at the same time. This PR updates the prefix boost order to match Babel's and updates the algorithm to match Babel's as closely as possible. Babel's algorithm does something significantly different from what NodeNorm's algorithm tries to do: Babel's algorithm for generating conflated synonym files uses the preferred name algorithm to find the best name for each unconflated clique, then picks the first preferred name when conflating multiple cliques; however, by the time we get to the create_node() code in NodeNorm, we've lost track of what the subcliques are, so instead we just run the "preferred label" algorithm on all the labels within the conflated clique and hope for the best. This PR modifies NodeNorm to try to replicate Babel's algorithm: although we lose track of the subcliques, when we know that we're dealing with a conflation, we can walk through all the identifiers one-by-one and try to find a subclique with at least one non-empty label. We use a set() to ensure that this is as efficient as possible. Ultimately, we should get rid of even this simplified code (#299) and just read the preferred name calculated by Babel for every clique, which is now present in the NodeNorm output files. And I don't think we'll hit the worst-case performance very often.

gaurav added this to the NodeNorm database redesign milestone Nov 4, 2024

gaurav mentioned this issue Nov 4, 2024

Updated NodeNorm preferred label to match Babel's #300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove NodeNorm's code for determining the preferred name for a clique #299

Remove NodeNorm's code for determining the preferred name for a clique #299

gaurav commented Nov 4, 2024 •

edited

Loading

Remove NodeNorm's code for determining the preferred name for a clique #299

Remove NodeNorm's code for determining the preferred name for a clique #299

Comments

gaurav commented Nov 4, 2024 • edited Loading

gaurav commented Nov 4, 2024 •

edited

Loading