Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comments on graph-based genome annotation model #7

Open
cmungall opened this issue Sep 12, 2016 · 3 comments
Open

comments on graph-based genome annotation model #7

cmungall opened this issue Sep 12, 2016 · 3 comments

Comments

@cmungall
Copy link

cmungall commented Sep 12, 2016

Some brief comments on:

https://github.com/SANBI-SA/combat_tb_model/blob/master/docs/genome_annotation_model.md

Very clearly documented, thank you @thobalose and @pvanheus. The overall strategy makes lots of sense. Chado was designed as a graph database, but layered on relational technology. As a result there are maybe a few design decisions that could be revisited.

Dbxrefs

There is less need for a primary dbxref node. For SciGraph/Monarch, we use a property ID, and require that this is a CURIE.

For secondary dbxrefs, sometimes we just treat these as properties decorated on the node, in other cases as nodes in their own right. In the latter case, we don't really think of the type as being dvxref - if it's a uniprot dbxref then it's a protein object. The chado modeling of dbxrefs somewhat reflects the original MOD use case and the split between 'in-house' entities and 'the others'. When making a database for more integrative use cases this split is less useful.

The use of dbxrefs in chado can also lead to a kind of 'fake' referential integrity checking. Some rough thoughts in this doc:

https://docs.google.com/document/d/1fmXtC1oAk_5T5IB6tgilYnVgcV1wCpfi8vj9J8Ht6fU/edit

As we merge from multiple sources, we're interested in interpreting xrefs as stricter relationships that allows us to merge equivalence cliques. @jnguyenx will fill out this soon: https://github.com/SciGraph/SciGraph/wiki/Post-processors#clique-merge

Feature Locations

One limitation of Chado (and GFF3 and subsumed models) is that the start and end of a feature must be on the same reference. This was something of a compromise between query tractability and normalization. In Monarch we use the FALDO model. It's designed as an RDF schema so it works perfectly well in Neo4j

Bolleman, J. T., Mungall, C. J., Strozzi, F., Baran, J., Dumontier, M., Bonnal, R. J. P., … Cock, P. J. A. (2016). FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation http://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0067-z

And for variant modeling, many groups like GA4GH are taking the approach of graph models with nucleotides as nodes. If everything you have can be mapped to a linear reference this buys you less, not sure about your use case here. Of course, both models can live in the same instance so long as there is a well-defined mapping.

Ontologies

Not much to add here, you have this down correctly. For mapping to more expressive formalisms like OWL there are subtleties, but I suggest you take advantage of existing mappings. For example: https://github.com/SciGraph/SciGraph/wiki/Neo4jMapping

The proposed obographs JSON exchange for ontologies and ontology fragments may be of use. You might want to target this format for loading.

APIs

Thanks for your useful notes, will check py2neo out (useful for us @kshefchek?). It seems GMOD is very heterogeneous in APIs, but in general anything targeted to Chado should in theory be mechanically mappable to this Neo4J model. It may be useful to gather like minded GMOD folks together to explore approaches.

@nathandunn is keen to do this for apollo but this bandwidth is low...

Constraints

Many of Chado's ref integrrity checks are fakish. You can't have a dangling surrogate key, but you can always have stub objects at the end. The original idea was to use axioms in SO to constrain, but that was under some naive assumptions regarding the suitability of an expressive open-world formalism (OWL) to do closed-world constraint checking.

However, this topic is huge in some segments of the semweb community at the moment. There are promising developments like Shex/SHACL. Crucially, while these are developed within an RDF framework, they can be made to work for Neo4J. In Monarch we do a lot of pre-processing and data munging in turtle, and then just load the turtle into Neo4J at the end. We're planning on targeting this upstream layer for constraint checking etc.

If we could provide some use cases I can feed them to some of these groups.

Also worth mentioning is WormBase's datomic schema (datomic has schemas)

@cmungall cmungall changed the title commons on graph-based genome annotation model comments on graph-based genome annotation model Sep 12, 2016
@pvanheus
Copy link
Member

Hi Chris, I'm sitting in Durban airport digesting this and the document you shared (https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit) (I hope it is ok if I mention that here).

A few link / info requests:

What is turtle?
What is the datomic schema?

@nathandunn
Copy link

Turtle is an RDF triple language used by FALDO (https://github.com/JervenBolleman/FALDO/ https://github.com/JervenBolleman/FALDO/):

https://www.w3.org/TeamSubmission/turtle/ https://www.w3.org/TeamSubmission/turtle/

Datomic is a graph datastore (I believe): http://docs.datomic.com/schema.html http://docs.datomic.com/schema.html

I’m sure that @cmungall will have more to say about using datomic versus neo4j. Looking on stack overflow (http://stackoverflow.com/a/17898956/1739366 http://stackoverflow.com/a/17898956/1739366) looks like a better querying language and an optimized performance when doing large read transaction.

On Sep 20, 2016, at 8:05 AM, pvanheus [email protected] wrote:

Hi Chris, I'm sitting in Durban airport digesting this and the document you shared (https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit https://docs.google.com/document/d/1iZbtUurhUuqsM2oxnfklXa18yGlNZwEYI8npWRukGyY/edit) (I hope it is ok if I mention that here).

A few link / info requests:

What is turtle?
What is the datomic schema?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #7 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAt2qhLA_napBNPktLc2tBNC8mMKua_Qks5qr_ZSgaJpZM4J69mf.

@cmungall
Copy link
Author

Yes, turtle is just one concrete form of RDF.

RDF dbs and graph dbs have a lot to offer one another, the difference between them is often cultural rather than technical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants