-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airr2akc refactor #18
base: main
Are you sure you want to change the base?
Conversation
…l dependency for writing output instead of direct printing
… cell expression, release version, alignment score etc), is invalid, convert to 'float'
- auto-detect classes from input schema file - single ak_airr output file containing all slots, classes, enums - reports errors when duplicate slots/enums are not identical - creates valid LinkML output, but: some issues need to be resolved in the input to be able to add all relevant info to the slots (e.g., 'required', 'nullable' and 'identifier' fields, which contradict when the same slot is re-used across different classes)
…and 'description' don't need to match.
# Conflicts: # project/jsonld/ak_schema.jsonld # project/linkml/ak_schema.yaml # project/owl/ak_schema.owl.ttl # src/ak_schema/datamodel/ak_schema.py
|
attempt to reopen PR to trigger automatic testing |
# Conflicts: # project/jsonld/ak_schema.jsonld # project/owl/ak_schema.owl.ttl # src/ak_schema/datamodel/ak_schema.py
… to have the same slot names with different content
Note: in the current solution, all slots are prefixed with the class name, because this is the only way it's possible to specify different values for 'required' (being required for some classes and not for others) |
Hi @LonnekeScheffer , I've added submodules for v1.5 and v2.0 of airr-standards so now those specs can be referenced directly instead of committing a local copy. Once you pull these changes, you will need to do an extra command to initialize the submodules (the README has it: Also I note that you were working from the v2.0 schema, and for now we want to use v1.5, so I changed the Makefile to reflect that. I did not regenerate the output. I will let you do that so you can verify everything is working properly. |
Regarding |
Hi @LonnekeScheffer , I've added the last bit to be done on the top post, then we can merge this and do a release. |
@bcorrie I got a little closer by 1) fixed bug with Dockerfile to install jq, 2) creating studies directory, 3) git clone https://github.com/sfu-ireceptor/config.git then changing branch to AKC, and 4) generating ak_schema.py
|
I'm able to run on main branch successfully, and verified all the latest code is on this branch, so it must be some difference in the schema causing the problem, but sure why. My guess is a name conflict... |
ok, that was part of it, conversion script had a "Repertoire" class which was conflicting with Repertoire brought over from AIRR schema. So now the error has something to do with an enum
|
@schristley @LonnekeScheffer you are doing this on the airrakc_refactor branch, after all of the specimen processing code has been merged in, is that correct? As far as I am aware, the branches that I was working on have been merged in... I can take a look at it. I can see why there might be conflicts in classes etc if all of the AIRR objects are now generating python classes (e.g. The Repertoire case above). In addition, if the Consolidated Spreadsheet is not up to date that may also cause some problems in the case where a slot is incorrectly being identified as referring to an external class. |
Yes
I turned on debugging and can see it gets the error when trying to create a NucleicAcidProcessing. The old NucleicAcidProcessing class is commented out so it is now the AIRR one, which has a few more fields in it, but template_class and it's enum seem to be same, so not sure why it is complaining about it. Also this study has the template_class set to RNA for all of the repertoires, so it's not like there is a null...
|
Another potential issue is that the converted AIRR classes aren't placed within the inheritance hierarchy, i.e. they don't have |
I am not sure how well it will handle the Enums. Let me check. If you look in the consolidated spreadsheet, there is a column that talks about "translation" - this is where an AIRR field actually requires some processing to convert the AIRR data to LinkML. Enum's are not likely to be 1-1, in the sense that in the AIRR world a string (e.g. "contains_schema_rearrangement") will need to be converted to an instance of a LinkML Enum (e.g. KeywordsStudyEnum) with value "contains_schema_rearrangement". Currently, the converter will just try to copy the string to slot or if the slot is denoted as an Enum in the consolidated spreadsheet it will probably create the class (which is what it seems to be doing) but then when it tries to assign the string it is failing ("ValueError: Unknown TemplateClassEnum enumeration code: "). So it looks to me like the mechanism to set a value for a LinkML enum code is not correct. Do you want me to investigate and make commits on this branch? |
Same is true of AIRR Ontologies. Right now, the converter is actually creating a YAML object that is the equivalent of the AIRR Ontology object (with and I think we should consider whether or not we just keep the ontology ID. By using a simple Enum for an ontology object we are discarding the Both the ADC and IEDB keep the
In the ADC there is a label element for all Ontology based fields. |
Yes please, I'm still investigating as well. From the error message, it suggests to me that it is trying to set a blank value for the template_class field? From what I can tell, there is no enum for template_class on the main branch. Here is the slot:
but now there is an enum attached to template_class
|
Just the ID needs to be brought over. The labels will be available, just not through the schema. Instead we will have database tables with ID and labels that come from our ak-ontology repository (not integrated completely yet). |
@bcorrie ok, I can manually reproduce in python. The issue seems to be that when creating an initial blank NucleicAcidProcessing, it wants a value for template class. Here is were I simulate your conversion code by creating an empty object
however if I create it with a template_class, then it is ok
|
I'm not sure why it is complaining about template_class specifically, it is not required. Plus NucleicAcidProcessing has other fields which are enums, like library_generation_method, and those seem fine as blank |
That is fine for our user interface and if using the DB, but if anyone is using the AKC using the query API or outputs data into a file using that API, they won't have the ontology labels. So by having the ontology labels in the AKC repository we are making our lives easier (in particular for our user interface), but we aren't making the lives of the consumer of the data in the AKC easier. Not necessarily against what you are suggesting, but playing devils advocate from the point of view of the end user of the AKC API. |
I have seen similar weird behaviour where some LinkML classes require things that others don't. If you look at the files generated by adc_convert (e.g. examples/adc/ipa1.ireceptor.org-PRJNA248411-AKC.json) you will see there is an empty investigation with an empty akc_id.
I have no idea why this is created and I can't seem to avoid its creation when I create an AIRRKnowledgeCommons object. |
Well, I guess I didn't explain myself well, the labels will be in AKC repository. They will be accessible from the query API. |
It seems to me that it would be reasonable for an Enum to require a value, since Enums probably shouldn't be created without a valid enum value I would assume??? So that behavior wasn't too surprising for me, but my code would need to make sure that happens. So maybe the question is why can you create a |
@bcorrie ok, I think I have it partially figured out. When creating an instance like this:
The blank string parameter is actually being assigned to the first slot in the class, which happens to be template_class! So the blank string, which is treated differently from
I replaced every instance in |
That makes a bit more sense now too. Before with
|
Some of the classes need to be changed. I didn't realize this before but the AIRR NucleicAcidProcessing slots were split out across multiple classes. This isn't necessary as they all belong under LibraryPreparationProcessing. Likewise ReceptorRepertoireSequencingAssay should be slots from AIRR SequencingRun. |
This is getting a bit off topic, as we are talking data modeling - maybe we need a new issue for this... My understanding was that we were trying to model these processes better than what we have in the AIRR Standard by using the modelling diagram that we have in the Miro board and trying to create a more precise and flexible model of how samples and data are processed. It is certainly less work (and therefore translation is easier) to just use the AIRR classes, but I thought we were trying to improve the model??? That is what I tried to do when I split these out in #12 . There are different OBI processes (cell processing, sequencing preparation, and library preparation - see the LinkML definitions for these classes) for each of these types of processing. Now we don't have to model these if we don't want, but to me the mapping was pretty natural and having them separate could be beneficial. So that is why the AIRR slots are spread out over multiple classes in the AKC LinkML. There is a 1:1 mapping (for the most part) of the actual slots, there is just not a 1:1 mapping of ADC class to AKC class. For the I don't think we have anything modelled in LinkML after the We can certainly use the AIRR classes for this directly if we want, in fact that is what I have as a place holder in the "Consolidate Mapping" sheet. But I would not have thought that those slots would be part of |
Fixing the initialization of the classes has eliminated this. |
While changing the initialization works for most classes, it doesn't work for
@bcorrie I'm not sure of the best way to resolve this in the convert script. Essentially we need some flag or info that lets us know how to init and/or we handle As a quick hack to test the rest of the conversion, can change the
|
It's worthwhile to have these discussions, though I'd prefer to do them in schema KG or other appropriate meeting versus writing lots of words. We are talking about semantics so it's better to talk it out. I committed a change to ak_specimens.yaml to the data model that I'm thinking. I also changed the name of |
@bcorrie if you put the |
needed to change the name in convert scripts too, with the |
hi @jamesaoverton , your IEDB conversion sample runs ok with the changes on this branch, except for the conversion to TTL. There is this error, which isn't very informative. Maybe something isn't conformant to the grammatical rules for TTL. Can you please take a look? FYI- you will want to do a
|
Major refactoring of airr2akc script: resolves issues raised in airr-knowledge/issues#28