Skip to content

Open Tree of Life APIs v3

Mark T. Holder edited this page Sep 14, 2015 · 34 revisions

This page just describes a proposal. The current version of the APIs is the one described on this page. This is intended as a cleaned-up version of this google-drive doc.

Calls

Data structures

Conflict API response

Conflict API response node fields

The object for each node can contain the following fields:

  • supported_by - list of input node references that support this node
  • conflicts_with - list of input node references that conflict with this node
  • resolves - list of input node references. Each of these nodes could be resolved in a manner that would result in a tree that displays a node that corresponds to this node (e.g. if you took the targe tree and created the tree induces by the leaf set of a tree listed in the resolves list, you would get a tree with a polytomy which does not display this node. However, a resolution of that polytomy would display this node).
  • partial_path_of - list of input ancestor-descendant path identifiers. Each tree in this list is compatible with the grouping edge below this node. But, in the tree created by taking the target tree and producing the induced tree for the listed tree, this edge is part of a path with "knuckles" that are not caused by monotypic taxanomic nodes.

Non-http interfaces

Synthetic tree

We plan on developing a treemachineLITE tool that will act as the server for the tree_of_life/* calls mentioned above. Unlike previous versions of the API, this tool will accept a specification of the tree structure and synthesis-related metadata. The previous version used a "graph of life" neo4j database as its input; that (for all intents and purposes) limited us to constructing the tree using neo4j.

We do not anticipate building an http-interface that would allow clients to push a new a new version of the synthetic tree. However, we do want to document the interface, so that other people can produce trees and use the treemachineLITE software to serve tree_of_life queries.

Format for the specification of a synthetic tree.

We plan to write the treemachineLITE to read in a newick representation of the tree and a JSON file with any additional information about a node.

v1.0 synth format = one newick, one JSON

The tree structure is to be described in one file using the rules described below.

The JSON file for additional data contains the fields described below with the node IDs of the newick used as keys for node and edge data.

In v1.0 of the synthetic tree format the node labels are either OTT Ids or (if not OTT Ids) arbitrary strings with no meaning.

v2.0 synth format = same information, but multiple files are supported

This is a slight tweak designed to make it easier to construct the full tree from smaller analyses.

The only difference between this format and the v1.0 is that:

  • the newick input can consist of multiple newicks. One newick will have the root of the tree labelled with an ID that occurs in no other newick. In all other newicks, the root of the newick will have label that is a tip ID in the "ancestral" newick. This indicates the grafting point for the newicks. No other IDs are allowed to be re-used across files.
  • node and edge information in the JSON can be specified in multiple files. A parser simply takes the union of the information. It will illegal to have conflicting information about same node and edge in different JSON files. Typically a node or edge would only be described in one file, but it is also permitted to have some complementary information about the same node/edge spread across multiple files.

v3.0 synth format

We are planning to implement a registry for node/path IDs. After that is implemented, we will be able to make the restriction that every node ID in the newick corresponds to the ID of this node in the registry.

For named taxa, this will be the OTT ID as in v1 and v2 So this change is just a change in the semantics of the labels of nodes that are not in OTT (from "meaningless label" to "registered ID").

Completely simple id-labelled newick

Comment on why we aren't planning to using an existing format

Newick is a commonly supported, terse format for expressing tree structure, but it is weak in terms of expressing other information because the meaning of the node labels and branch length information is not specified by the standard. The New Hampshire Extended convention and the metacomments used by BEAST and associated tools rectify this via "hot comments". While these solutions work, they increase the size of the newick. For large trees, this makes handling the tree representation cumbersome, and is particularly galling for client code that is only interested in the tree representation.

NeXML if very nice for representing rich data, but also results in a very large representation for at tree of over 2 million leaves. The richness of NeXML's annotations relies heavily on the fact that the fundamental entities of the format have IDs that can serve as the target of an annotation.

Rules for the completely simple id-labelled newick format.

The format obeys the rules of the standard newick format, but addes the following restrictions:

  1. Every node must have a label (hence "complete")
  2. each label fits the regex [a-zA-Z0-9]+. In other words only numbers and roman alphabet are allowed. (hence "simple")
  3. each label is unique (hence "id-labelled") in the context of the newick (not necessarily a globally unique ID).
  4. branch lengths are not included in the newick representation (and therefore, colons do not appear in the tree representation

The unique IDs can be used in accompanying data structures to uniquely refer to any node in the tree. The mandatory ID expands the size of the newicks somewhat, but requiring simple node labels makes it much easier to implement a validating parser.

Synthetic tree additional data

"tree level" fields

These fields specify information about the synthetic tree's construction (and are used in the tree stats and tree_of_life/about calls):

  • date_completed - the date the synthetic tree's construction was completed.
  • tree_id - a unique identifier for this version of the synthetic tree
  • taxonomy_version - the identifier for the version of the taxonomy that was used.
  • num_tips - the number of leaves in the tree
  • run_time - an estimate of the time taken to build the tree
  • num_source_trees - the number of input trees not counting the taxonomy
  • num_source_studies - the number of studies that contributed trees to the num_source_trees
  • root_taxon_name - (e.g. "cellular organisms")
  • root_ott_id
  • sources - list of strings. Each element is a reference to a source trees where the source_id_map is used to provide the additional data for each source. The list is in order, if order of trees affects the supertree.
  • source_id_map - object with string keys that map to objects describing the source (see below)
  • generated_by - list of objects that describes the software tools and versions used to build the tree (see below)

The other top-level fields hold information on the nodes and edges:

  • nodes - object with node_id's as keys used to describe the nodes in the tree (see below)
  • edges - object with node_id's as keys used to describe the edges in the tree (see below)

With the exception of nodes and edges this info occurs in only 1 JSON file at the highest level of the JSON, even in the v2 format that supports multiple files.

source_id_map objects

  • study_id
  • tree_id
  • git_sha

generated_by objects

  • name - name of the software
  • version - version string
  • git_sha - the version identifier for the source code
  • url - link to the tool
  • invocation - list of strings describing the command line arguments, possibly containing place holders like "<STUDY_LIST>". intended to document how to run the tool.

node fields

Optionally, each node could contain the fields returned by the conflict API when run across the synthetic tree inputs.

edge fields

  • length
Clone this wiki locally