HTTP liveness probe #17771

doup123 · 2024-07-22T13:18:49Z

Is your feature request related to a problem? Please describe.

Having an HTTP liveness probe provides a de-facto way to identify the health status of an application (e.g. https://github.com/influxdata/telegraf/blob/master/plugins/outputs/health/README.md).
This within the cloud world would allow:

To enable a liveness probe and restart any failed risingwave containers
To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

Describe the solution you'd like

A simple HTTP endpoint that would return 200 if risingwave works as expected, while 503 when it is not.

Describe alternatives you've considered

The current alternative requires:

wget https://raw.githubusercontent.com/risingwavelabs/risingwave/main/proto/health.proto
install grpcurl
and execute grpcurl -plaintext -d '{}' -import-path . -proto health.proto localhost:5690 health.Health/Check that returns:
{
"status": "SERVING"
}

Additional context

No response

BugenZhao · 2024-07-23T05:50:05Z

Hi, thanks for your feedback.

The recommended approach to do health checking is currently through SQL interface with pg_is_in_recovery() or rw_recovery_status(), which will be available in the upcoming v1.11 release and also adopted by RisingWave Cloud.

The gRPC health check is also a standard interface, but I'm afraid it's not correctly implemented yet.

fuyufjh · 2024-08-19T08:43:00Z

It's not trivial to define "liveness" here. For example:

If the cluster (i.e. Meta Service) is bootstrapping, what is the status of a compute node?
If the cluster (i.e. Meta Service) is under recovery, what is the status of a compute node? (Remember that the CN can serve batch queries now)
If the compute node lost the heartbeat with Meta service, what is the status of a compute node? (IIUC, CN can serve batch queries now)

doup123 · 2024-08-23T14:29:55Z

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy?
IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy.
I am sure that you can define better than me the conditions that should be satisfied by each component.

fuyufjh · 2024-09-04T05:54:53Z

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy? IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy. I am sure that you can define better than me the conditions that should be satisfied by each component.

It's clear to define whether a cluster is healthy. However, this issue is talking about how to identify a component or a Pod (e.g. Compute node, Frontend Node, Meta node, compactor node, etc.) is healthy, right? This will be ambiguous...

xxchan · 2024-09-05T07:29:30Z

May I ask did you meet any real problems that want to solve with liveness probe?

According to the use cases you mentioned:

To enable a liveness probe and restart any failed risingwave containers

I think this can be handled by Kubernetes (risingwave-operator).

To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step..

BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

doup123 · 2024-09-05T07:43:49Z

@xxchan thank you for your responses.

I think this can be handled by Kubernetes (risingwave-operator).
Probably you are correct on this

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step..
Considering the scenario that multiple clusters run anycasted via K8s in different geolocations, IMHO there should be a way to check if a cluster is operational or not, to send or withdraw traffic accordingly to it.

BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

The work that has been done with the exposed metrics via Prometheus and the corresponding dashboards is great and I will rely on it to check the "status" of the cluster. What I was trying to say is that it is a very common practice to have an HTTP endpoint with the health status of a service that can be directly used for monitoring/alerting.

doup123 added the type/feature label Jul 22, 2024

github-actions bot added this to the release-1.11 milestone Jul 22, 2024

BugenZhao mentioned this issue Jul 23, 2024

correctly implement gRPC health checking, or remove it #17778

Open

fuyufjh removed this from the release-2.0 milestone Aug 19, 2024

doup123 closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP liveness probe #17771

HTTP liveness probe #17771

doup123 commented Jul 22, 2024

BugenZhao commented Jul 23, 2024

fuyufjh commented Aug 19, 2024

doup123 commented Aug 23, 2024

fuyufjh commented Sep 4, 2024 •

edited

Loading

xxchan commented Sep 5, 2024

doup123 commented Sep 5, 2024

HTTP liveness probe #17771

HTTP liveness probe #17771

Comments

doup123 commented Jul 22, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

BugenZhao commented Jul 23, 2024

fuyufjh commented Aug 19, 2024

doup123 commented Aug 23, 2024

fuyufjh commented Sep 4, 2024 • edited Loading

xxchan commented Sep 5, 2024

doup123 commented Sep 5, 2024

fuyufjh commented Sep 4, 2024 •

edited

Loading