Onyx Peer HTTP Query provides an inbuilt HTTP server to service replica and cluster queries that can be directed at Onyx nodes. One use case is to provide a health check for your Onyx nodes, as it becomes easy to determine what a node's view of the cluster is.
This library exposes an HTTP server to service replica and cluster queries across languages.
To use it, add onyx-peer-http-query to your dependencies:
[org.onyxplatform/onyx-peer-http-query "0.14.5.0"]
Require onyx.http-query in your peer bootup namespace:
(:require [onyx.http-query])
And add the following lines to Onyx's peer-config
:onyx.query/server? true
:onyx.query.server/port 8080
In addition, you can optionally add the IP to listen on with
:onyx.query.server/ip "127.0.0.1"
JMX selectors can, and should be whitelisted/queried via the peer-config: e.g.
:onyx.query.server/metrics-selectors ["org.onyxplatform:*" "com.amazonaws.management:*"]
The default behaviour is
:onyx.query.server/metrics-selectors ["*:*"]
Individual metrics tags can be blacklisted via the peer-config:
:onyx.query.server/metrics-blacklist [#"blacklisted_tag1" #"blacklistregex.*"]
Then query it to get a view of that nodes understanding of the cluster:
$ http --json http://localhost:8080/replica/peers
HTTP/1.1 200 OK
Content-Length: 197
Content-Type: application/json
Date: Tue, 23 Feb 2016 03:35:08 GMT
Server: Jetty(9.2.10.v20150310)
{
"as-of-entry": 12,
"as-of-timestamp": 1456108757818,
"result": [
"e52df81d-38c9-44e6-9e3d-177d3e83292b",
"fd4725f9-3429-49eb-840d-6c3e29cecc41",
"fc933dda-7260-4547-93fc-241a02ca599a"
],
"status": "success"
}
Note as-of-entry
and as-of-timestamp
. By comparing as-of-entry
between
nodes, you can discover whether a node is lagging behind the cluster.
Further API endpoints are described here.
The Replica Query Server has a number of endpoints for accessing the information about a running Onyx cluster. Below we display the HTTP method, the URI, the docstring for the route, and any associated parameters that it takes in its query string.
/health
/peergroup/heartbeat
/peergroup/stuckpeers
/peergroup/health
/network/media-driver/active
/metrics
/state
/job/catalog
/job/flow-conditions
/job/lifecycles
/job/task
/job/triggers
/job/windows
/job/workflow
/job/exception
/replica
/replica/completed-jobs
/replica/job-allocations
/replica/job-scheduler
/replica/jobs
/replica/killed-jobs
/replica/peer-site
/replica/peer-state
/replica/peers
/replica/task-allocations
/replica/allocation-version
/replica/task-scheduler
/replica/tasks
[:get]
/health
{"threshold" java.lang.Long}
A single health check call to check whether the following statuses are healthy: /network/media-driver/active
, /peergroup/heartbeat
, and /peergroup/stuckpeers
. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.
--
[:get]
/peergroup/heartbeat
{}
Returns the number of milliseconds since the peer group last heartbeated.
[:get]
/peergroup/stuckpeers
{}
Returns the number of milliseconds that a peer has been stuck while being shutdown, indicating a stuck thread.
[:get]
/peergroup/health
{}
A health check call to check whether the peer group has heartbeated more recently than a threshold. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.
[:get]
/network/media-driver
{}
Returns a map describing the media driver status. e.g.
{:active true,
:driver-timeout-ms 10000,
:log "INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists
INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists
INFO: Aeron toDriver consumer heartbeat is 687 ms old"}
[:get]
/network/media-driver/active
{}
Returns a boolean for whether the media driver is active and has heartbeated within driver-timeout-ms milliseconds.
[:get]
/metrics
{}
Returns any numeric JMX metrics contained in this VM, converted to prometheus tags.
[:get]
/state
{"job-id" java.lang.String "task-id" java.lang.String "slot-id" java.lang.Long "window-id" java.lang.String "allocation-version" java.lang.Long ;; optional "start-time" java.lang.Long ;; optional "end-time" java.lang.Long ;; optional "groups" [Any]}
Retrieve a task's window state for a particular job. Must supply the :allocation-version for the job. The allocation version can be looked up via the /replica/allocation-version, or by subscribing to the log and looking up the [:allocation-version job-id].
If groups is supplied, only the state for the groups supplied will be retrieved.
[:get]
/job/catalog
{"job-id" java.lang.String}
Given a job id, returns catalog for this job.
[:get]
/job/flow-conditions
{"job-id" java.lang.String}
Given a job id, returns flow conditions for this job.
[:get]
/job/lifecycles
{"job-id" java.lang.String}
Given a job id, returns lifecycles for this job.
[:get]
/job/task
{"job-id" java.lang.String, "task-id" java.lang.String}
Given a job id and task id, returns catalog entry for this task.
[:get]
/job/triggers
{"job-id" java.lang.String}
Given a job id, returns triggers for this job.
[:get]
/job/windows
{"job-id" java.lang.String}
Given a job id, returns windows for this job.
[:get]
/job/workflow
{"job-id" java.lang.String}
Given a job id, returns workflow for this job.
[:get]
/job/exception
{"job-id" java.lang.String}
Given a job id, returns the exception that killed this job, if one exists.
[:get]
/replica
``
Derefences the replica as an immutable value.
[:get]
/replica/completed-jobs
``
Lists all the job ids that have been completed.
[:get]
/replica/job-allocations
``
Returns a map of job id -> task id -> peer ids, denoting which peers are assigned to which tasks.
[:get]
/replica/job-scheduler
Returns the job scheduler for this tenancy of the cluster.
[:get]
/replica/jobs
``
Lists all non-killed, non-completed job ids.
[:get]
/replica/killed-jobs
``
Lists all the job ids that have been killed.
[:get]
/replica/peer-site
{"peer-id" java.lang.String}
Given a peer id, returns the Aeron hostname and port that this peer advertises to the rest of the cluster.
[:get]
/replica/peer-state
{"peer-id" java.lang.String}
Given a peer id, returns its current execution state (e.g. :idle, :active, etc).
[:get]
/replica/peers
``
Lists all the peer ids.
[:get]
/replica/task-allocations
``
Given a job id, returns a map of task id -> peer ids, denoting which peers are assigned to which tasks for this job only.
[:get]
/replica/allocation-version
{"job-id" java.lang.String}
Given a job id, returns the replica-version at which the job last rescheduled. This is important because the replica-version forms part of the vector clock that is used to determine ordering/validity of messages in the cluster, along with the barrier epoch.
[:get]
/replica/task-scheduler
{"job-id" java.lang.String}
Given a job id, returns the task scheduler for this job.
[:get]
/replica/tasks
{"job-id" java.lang.String}
Given a job id, returns all the task ids for this job.
Copyright © 2016 Distributed Masonry Inc.
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.