Onyx Peer HTTP Query provides an inbuilt HTTP server to service replica and cluster queries that can be directed at Onyx nodes. One use case is to provide a health check for your Onyx nodes, as it becomes easy to determine what a node's view of the cluster is.
This library exposes an HTTP server to service replica and cluster queries across languages.
To use it, add onyx-peer-http-query to your dependencies:
[org.onyxplatform/onyx-peer-http-query ""]
Require onyx.http-query in your peer bootup namespace:
(:require [onyx.http-query])
And add the following lines to Onyx's peer-config
:onyx.query/server? true
:onyx.query.server/port 8080
In addition, you can optionally add the IP to listen on with
:onyx.query.server/ip ""
JMX selectors can, and should be whitelisted/queried via the peer-config: e.g.
:onyx.query.server/metrics-selectors ["org.onyxplatform:*" "com.amazonaws.management:*"]
The default behaviour is
:onyx.query.server/metrics-selectors ["*:*"]
Individual metrics tags can be blacklisted via the peer-config:
:onyx.query.server/metrics-blacklist [#"blacklisted_tag1" #"blacklistregex.*"]
Then query it to get a view of that nodes understanding of the cluster:
$ http --json http://localhost:8080/replica/peers
HTTP/1.1 200 OK
Content-Length: 197
Content-Type: application/json
Date: Tue, 23 Feb 2016 03:35:08 GMT
Server: Jetty(9.2.10.v20150310)
"as-of-entry": 12,
"as-of-timestamp": 1456108757818,
"result": [
"status": "success"
Note as-of-entry
and as-of-timestamp
. By comparing as-of-entry
nodes, you can discover whether a node is lagging behind the cluster.
Further API endpoints are described here.
The Replica Query Server has a number of endpoints for accessing the information about a running Onyx cluster. Below we display the HTTP method, the URI, the docstring for the route, and any associated parameters that it takes in its query string.
{"threshold" java.lang.Long}
A single health check call to check whether the following statuses are healthy: /network/media-driver/active
, /peergroup/heartbeat
, and /peergroup/stuckpeers
. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.
Returns the number of milliseconds since the peer group last heartbeated.
Returns the number of milliseconds that a peer has been stuck while being shutdown, indicating a stuck thread.
A health check call to check whether the peer group has heartbeated more recently than a threshold. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.
Returns a map describing the media driver status. e.g.
{:active true,
:driver-timeout-ms 10000,
:log "INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists
INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists
INFO: Aeron toDriver consumer heartbeat is 687 ms old"}
Returns a boolean for whether the media driver is active and has heartbeated within driver-timeout-ms milliseconds.
Returns any numeric JMX metrics contained in this VM, converted to prometheus tags.
{"job-id" java.lang.String "task-id" java.lang.String "slot-id" java.lang.Long "window-id" java.lang.String "allocation-version" java.lang.Long ;; optional "start-time" java.lang.Long ;; optional "end-time" java.lang.Long ;; optional "groups" [Any]}
Retrieve a task's window state for a particular job. Must supply the :allocation-version for the job. The allocation version can be looked up via the /replica/allocation-version, or by subscribing to the log and looking up the [:allocation-version job-id].
If groups is supplied, only the state for the groups supplied will be retrieved.
{"job-id" java.lang.String}
Given a job id, returns catalog for this job.
{"job-id" java.lang.String}
Given a job id, returns flow conditions for this job.
{"job-id" java.lang.String}
Given a job id, returns lifecycles for this job.
{"job-id" java.lang.String, "task-id" java.lang.String}
Given a job id and task id, returns catalog entry for this task.
{"job-id" java.lang.String}
Given a job id, returns triggers for this job.
{"job-id" java.lang.String}
Given a job id, returns windows for this job.
{"job-id" java.lang.String}
Given a job id, returns workflow for this job.
{"job-id" java.lang.String}
Given a job id, returns the exception that killed this job, if one exists.
Derefences the replica as an immutable value.
Lists all the job ids that have been completed.
Returns a map of job id -> task id -> peer ids, denoting which peers are assigned to which tasks.
Returns the job scheduler for this tenancy of the cluster.
Lists all non-killed, non-completed job ids.
Lists all the job ids that have been killed.
{"peer-id" java.lang.String}
Given a peer id, returns the Aeron hostname and port that this peer advertises to the rest of the cluster.
{"peer-id" java.lang.String}
Given a peer id, returns its current execution state (e.g. :idle, :active, etc).
Lists all the peer ids.
Given a job id, returns a map of task id -> peer ids, denoting which peers are assigned to which tasks for this job only.
{"job-id" java.lang.String}
Given a job id, returns the replica-version at which the job last rescheduled. This is important because the replica-version forms part of the vector clock that is used to determine ordering/validity of messages in the cluster, along with the barrier epoch.
{"job-id" java.lang.String}
Given a job id, returns the task scheduler for this job.
{"job-id" java.lang.String}
Given a job id, returns all the task ids for this job.
Copyright © 2016 Distributed Masonry Inc.
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.