-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comments on graph-based genome annotation model #7
Comments
Hi Chris, I'm sitting in Durban airport digesting this and the document you shared (https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit) (I hope it is ok if I mention that here). A few link / info requests: What is turtle? |
Turtle is an RDF triple language used by FALDO (https://github.com/JervenBolleman/FALDO/ https://github.com/JervenBolleman/FALDO/): https://www.w3.org/TeamSubmission/turtle/ https://www.w3.org/TeamSubmission/turtle/ Datomic is a graph datastore (I believe): http://docs.datomic.com/schema.html http://docs.datomic.com/schema.html I’m sure that @cmungall will have more to say about using datomic versus neo4j. Looking on stack overflow (http://stackoverflow.com/a/17898956/1739366 http://stackoverflow.com/a/17898956/1739366) looks like a better querying language and an optimized performance when doing large read transaction.
|
Yes, turtle is just one concrete form of RDF. RDF dbs and graph dbs have a lot to offer one another, the difference between them is often cultural rather than technical. |
Some brief comments on:
https://github.com/SANBI-SA/combat_tb_model/blob/master/docs/genome_annotation_model.md
Very clearly documented, thank you @thobalose and @pvanheus. The overall strategy makes lots of sense. Chado was designed as a graph database, but layered on relational technology. As a result there are maybe a few design decisions that could be revisited.
Dbxrefs
There is less need for a primary dbxref node. For SciGraph/Monarch, we use a property ID, and require that this is a CURIE.
For secondary dbxrefs, sometimes we just treat these as properties decorated on the node, in other cases as nodes in their own right. In the latter case, we don't really think of the type as being dvxref - if it's a uniprot dbxref then it's a protein object. The chado modeling of dbxrefs somewhat reflects the original MOD use case and the split between 'in-house' entities and 'the others'. When making a database for more integrative use cases this split is less useful.
The use of dbxrefs in chado can also lead to a kind of 'fake' referential integrity checking. Some rough thoughts in this doc:
https://docs.google.com/document/d/1fmXtC1oAk_5T5IB6tgilYnVgcV1wCpfi8vj9J8Ht6fU/edit
As we merge from multiple sources, we're interested in interpreting xrefs as stricter relationships that allows us to merge equivalence cliques. @jnguyenx will fill out this soon: https://github.com/SciGraph/SciGraph/wiki/Post-processors#clique-merge
Feature Locations
One limitation of Chado (and GFF3 and subsumed models) is that the start and end of a feature must be on the same reference. This was something of a compromise between query tractability and normalization. In Monarch we use the FALDO model. It's designed as an RDF schema so it works perfectly well in Neo4j
Bolleman, J. T., Mungall, C. J., Strozzi, F., Baran, J., Dumontier, M., Bonnal, R. J. P., … Cock, P. J. A. (2016). FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation http://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0067-z
And for variant modeling, many groups like GA4GH are taking the approach of graph models with nucleotides as nodes. If everything you have can be mapped to a linear reference this buys you less, not sure about your use case here. Of course, both models can live in the same instance so long as there is a well-defined mapping.
Ontologies
Not much to add here, you have this down correctly. For mapping to more expressive formalisms like OWL there are subtleties, but I suggest you take advantage of existing mappings. For example: https://github.com/SciGraph/SciGraph/wiki/Neo4jMapping
The proposed obographs JSON exchange for ontologies and ontology fragments may be of use. You might want to target this format for loading.
APIs
Thanks for your useful notes, will check py2neo out (useful for us @kshefchek?). It seems GMOD is very heterogeneous in APIs, but in general anything targeted to Chado should in theory be mechanically mappable to this Neo4J model. It may be useful to gather like minded GMOD folks together to explore approaches.
@nathandunn is keen to do this for apollo but this bandwidth is low...
Constraints
Many of Chado's ref integrrity checks are fakish. You can't have a dangling surrogate key, but you can always have stub objects at the end. The original idea was to use axioms in SO to constrain, but that was under some naive assumptions regarding the suitability of an expressive open-world formalism (OWL) to do closed-world constraint checking.
However, this topic is huge in some segments of the semweb community at the moment. There are promising developments like Shex/SHACL. Crucially, while these are developed within an RDF framework, they can be made to work for Neo4J. In Monarch we do a lot of pre-processing and data munging in turtle, and then just load the turtle into Neo4J at the end. We're planning on targeting this upstream layer for constraint checking etc.
If we could provide some use cases I can feed them to some of these groups.
Also worth mentioning is WormBase's datomic schema (datomic has schemas)
The text was updated successfully, but these errors were encountered: