-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Medgen updates #7
Conversation
df9d980
to
f208076
Compare
@@ -0,0 +1,115 @@ | |||
"""Mapping status between Medgen and Mondo""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repurposed some of my GARD code to create these attached atefacts.
Feels a bit duplicative, but this is probably fine for now. However I'm likely to do more of this for other ingests as well.
I did simplify some of it. Most of the code is the same between Medgen and GARD, but some things are special to each.
medgen_terms_mapping_status.tsv.zip
obsoleted_medgen_terms_in_mondo.txt
Some counts:
# tot_medgen_only = len(existing_overlap_df[existing_overlap_df['status'] == 'medgen']) # n=66,224
# tot_mondo_only = len(existing_overlap_df[existing_overlap_df['status'] == 'mondo']) # n=2,362
# tot_both_only = len(existing_overlap_df[existing_overlap_df['status'] == 'both']) # n=14,263
mondo_df['prefix'] = mondo_df['object_id'].apply(lambda x: x.split(':')[0]) | ||
mondo_df = mondo_df[mondo_df['prefix'].isin(MEDGEN_PREFIXES)] # n=16,627 | ||
del mondo_df['prefix'] | ||
# preds = list(mondo_df['predicate_id'].unique()) # only skos:exactMatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactMatch thoughts
I imagined we want to do the same thing here as with GARD where we only care about exact matches.
However a couple things I discovered so far are:
- Our prior Mondo->Medgen mappings were only of
skos:exactMatch
- All of the mappings coming out of Chris' pipeline are only of:
oboInOwl:hasDbXref
orowl:equivalentClass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all owl:equivalentClass
to skos:exactMatch
in the Medgen ingest. owl:equivalentClass is no longer relevant anywhere across our pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I've figured out how to do this. In OBO, this:
equivalent_to: <CURIE>
Should be changed to this:
property_value: exactMatch <CURIE>
I just need to change the Perl that creates the OBO like this. Then the pipeline will generate the correct OWL.
…hey started with 'C' and a number. - Update: prefixes: In addition to new classes above, renamed UMLS prefix with Medgen for all other classes (which happen to all start with 'CN:' - Update: prefixes: Renamed prior MEDGEN: xref prefixes to Medgen_UID: These IDs don't start with C (CUI; Concept Unique Identifier) or CN (Common Name?). These are internal Medgen UIDs that are duplicative and not for clinical or analytical use. - Rename: bin/ -> src/ - Add: output/: For both release outputs and non-release. - Rename: release/ -> output/release/ - Add: mondo_mapping_status.py: For generating artefacts related to the reporting and management of mappings between Mondo and Medgen. - Add: Python dependency requirements files. - Add: run.sh: For running commands in ODK - Add: config/medgen.sssom-metadata.yml
a89f466
to
05e8ecd
Compare
…riples as a function, (ii) updated namespacing of classes based on what type of MedGen/UMLS identifier they are. - Update: Namespaces MedGen, MedGen_UI (removed), MedGenCUI - Bugfix: SSSOM metadata yaml had a typo preventing conversion - Bugfix: Makefile: (i) needed to rename a dependency, (ii) needed to run 'analyze' step after 'stage' - Update: Makefile: Simplified some goals - Bugfix: For UMLS CUIs (e.g. starts with C then #s), we chose to do duplicate classes with namespaces UMLS and MedGen. However, I just now made it so that also all references (e.g. xrefs) are also duplicated, e.g. MedGen:1 maps to MedGen:2 and UMLS:2.
@matentzn I'm going to merge this one for now. A handful of misc work was done on this ingest, but we're going to be pausing it for now, holding off to see if the MedGen team is able to handle a lot of the work that this ingest would otherwise be doing. For future work on this ingest, when/if ever needed, I'll open up new PRs. I did wrap up a couple things from our least meeting though. Had to change a conditional block, and updated the namespaces, e.g. MedGen -> MEDGEN. |
Updates
a4eff72a96a4a018fb46a1222f25c968312cecb9
f461d52abf3c7eb4981c43ab0a63aca653a333db