comments on graph-based genome annotation model #7

cmungall · 2016-09-12T19:52:37Z

Some brief comments on:

https://github.com/SANBI-SA/combat_tb_model/blob/master/docs/genome_annotation_model.md

Very clearly documented, thank you @thobalose and @pvanheus. The overall strategy makes lots of sense. Chado was designed as a graph database, but layered on relational technology. As a result there are maybe a few design decisions that could be revisited.

Dbxrefs

There is less need for a primary dbxref node. For SciGraph/Monarch, we use a property ID, and require that this is a CURIE.

For secondary dbxrefs, sometimes we just treat these as properties decorated on the node, in other cases as nodes in their own right. In the latter case, we don't really think of the type as being dvxref - if it's a uniprot dbxref then it's a protein object. The chado modeling of dbxrefs somewhat reflects the original MOD use case and the split between 'in-house' entities and 'the others'. When making a database for more integrative use cases this split is less useful.

The use of dbxrefs in chado can also lead to a kind of 'fake' referential integrity checking. Some rough thoughts in this doc:

https://docs.google.com/document/d/1fmXtC1oAk_5T5IB6tgilYnVgcV1wCpfi8vj9J8Ht6fU/edit

As we merge from multiple sources, we're interested in interpreting xrefs as stricter relationships that allows us to merge equivalence cliques. @jnguyenx will fill out this soon: https://github.com/SciGraph/SciGraph/wiki/Post-processors#clique-merge

Feature Locations

One limitation of Chado (and GFF3 and subsumed models) is that the start and end of a feature must be on the same reference. This was something of a compromise between query tractability and normalization. In Monarch we use the FALDO model. It's designed as an RDF schema so it works perfectly well in Neo4j

Bolleman, J. T., Mungall, C. J., Strozzi, F., Baran, J., Dumontier, M., Bonnal, R. J. P., … Cock, P. J. A. (2016). FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation http://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0067-z

And for variant modeling, many groups like GA4GH are taking the approach of graph models with nucleotides as nodes. If everything you have can be mapped to a linear reference this buys you less, not sure about your use case here. Of course, both models can live in the same instance so long as there is a well-defined mapping.

Ontologies

Not much to add here, you have this down correctly. For mapping to more expressive formalisms like OWL there are subtleties, but I suggest you take advantage of existing mappings. For example: https://github.com/SciGraph/SciGraph/wiki/Neo4jMapping

The proposed obographs JSON exchange for ontologies and ontology fragments may be of use. You might want to target this format for loading.

APIs

Thanks for your useful notes, will check py2neo out (useful for us @kshefchek?). It seems GMOD is very heterogeneous in APIs, but in general anything targeted to Chado should in theory be mechanically mappable to this Neo4J model. It may be useful to gather like minded GMOD folks together to explore approaches.

@nathandunn is keen to do this for apollo but this bandwidth is low...

Constraints

Many of Chado's ref integrrity checks are fakish. You can't have a dangling surrogate key, but you can always have stub objects at the end. The original idea was to use axioms in SO to constrain, but that was under some naive assumptions regarding the suitability of an expressive open-world formalism (OWL) to do closed-world constraint checking.

However, this topic is huge in some segments of the semweb community at the moment. There are promising developments like Shex/SHACL. Crucially, while these are developed within an RDF framework, they can be made to work for Neo4J. In Monarch we do a lot of pre-processing and data munging in turtle, and then just load the turtle into Neo4J at the end. We're planning on targeting this upstream layer for constraint checking etc.

If we could provide some use cases I can feed them to some of these groups.

Also worth mentioning is WormBase's datomic schema (datomic has schemas)

pvanheus · 2016-09-20T15:05:53Z

Hi Chris, I'm sitting in Durban airport digesting this and the document you shared (https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit) (I hope it is ok if I mention that here).

A few link / info requests:

What is turtle?
What is the datomic schema?

nathandunn · 2016-09-20T15:34:05Z

Turtle is an RDF triple language used by FALDO (https://github.com/JervenBolleman/FALDO/ https://github.com/JervenBolleman/FALDO/):

https://www.w3.org/TeamSubmission/turtle/ https://www.w3.org/TeamSubmission/turtle/

Datomic is a graph datastore (I believe): http://docs.datomic.com/schema.html http://docs.datomic.com/schema.html

I’m sure that @cmungall will have more to say about using datomic versus neo4j. Looking on stack overflow (http://stackoverflow.com/a/17898956/1739366 http://stackoverflow.com/a/17898956/1739366) looks like a better querying language and an optimized performance when doing large read transaction.

On Sep 20, 2016, at 8:05 AM, pvanheus [email protected] wrote:

Hi Chris, I'm sitting in Durban airport digesting this and the document you shared (https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit) (I hope it is ok if I mention that here).

A few link / info requests:

What is turtle?
What is the datomic schema?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #7 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAt2qhLA_napBNPktLc2tBNC8mMKua_Qks5qr_ZSgaJpZM4J69mf.

cmungall · 2016-09-20T22:39:44Z

Yes, turtle is just one concrete form of RDF.

RDF dbs and graph dbs have a lot to offer one another, the difference between them is often cultural rather than technical.

cmungall changed the title ~~commons on graph-based genome annotation model~~ comments on graph-based genome annotation model Sep 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comments on graph-based genome annotation model #7

comments on graph-based genome annotation model #7

cmungall commented Sep 12, 2016 •

edited

Loading

pvanheus commented Sep 20, 2016

nathandunn commented Sep 20, 2016

cmungall commented Sep 20, 2016

comments on graph-based genome annotation model #7

comments on graph-based genome annotation model #7

Comments

cmungall commented Sep 12, 2016 • edited Loading

Dbxrefs

Feature Locations

Ontologies

APIs

Constraints

pvanheus commented Sep 20, 2016

nathandunn commented Sep 20, 2016

cmungall commented Sep 20, 2016

cmungall commented Sep 12, 2016 •

edited

Loading