-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP liveness probe #17771
Comments
Hi, thanks for your feedback. The recommended approach to do health checking is currently through SQL interface with The gRPC health check is also a standard interface, but I'm afraid it's not correctly implemented yet. |
It's not trivial to define "liveness" here. For example:
|
@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy? |
It's clear to define whether a cluster is healthy. However, this issue is talking about how to identify a component or a Pod (e.g. Compute node, Frontend Node, Meta node, compactor node, etc.) is healthy, right? This will be ambiguous... |
May I ask did you meet any real problems that want to solve with liveness probe? According to the use cases you mentioned:
I think this can be handled by Kubernetes (risingwave-operator).
May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step.. BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster. |
@xxchan thank you for your responses.
The work that has been done with the exposed metrics via Prometheus and the corresponding dashboards is great and I will rely on it to check the "status" of the cluster. What I was trying to say is that it is a very common practice to have an HTTP endpoint with the health status of a service that can be directly used for monitoring/alerting. |
Is your feature request related to a problem? Please describe.
Having an HTTP liveness probe provides a de-facto way to identify the health status of an application (e.g. https://github.com/influxdata/telegraf/blob/master/plugins/outputs/health/README.md).
This within the cloud world would allow:
Describe the solution you'd like
A simple HTTP endpoint that would return 200 if risingwave works as expected, while 503 when it is not.
Describe alternatives you've considered
The current alternative requires:
{
"status": "SERVING"
}
Additional context
No response
The text was updated successfully, but these errors were encountered: