Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP liveness probe #17771

Closed
doup123 opened this issue Jul 22, 2024 · 6 comments
Closed

HTTP liveness probe #17771

doup123 opened this issue Jul 22, 2024 · 6 comments

Comments

@doup123
Copy link

doup123 commented Jul 22, 2024

Is your feature request related to a problem? Please describe.

Having an HTTP liveness probe provides a de-facto way to identify the health status of an application (e.g. https://github.com/influxdata/telegraf/blob/master/plugins/outputs/health/README.md).
This within the cloud world would allow:

  1. To enable a liveness probe and restart any failed risingwave containers
  2. To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

Describe the solution you'd like

A simple HTTP endpoint that would return 200 if risingwave works as expected, while 503 when it is not.

Describe alternatives you've considered

The current alternative requires:

  1. wget https://raw.githubusercontent.com/risingwavelabs/risingwave/main/proto/health.proto
  2. install grpcurl
  3. and execute grpcurl -plaintext -d '{}' -import-path . -proto health.proto localhost:5690 health.Health/Check that returns:
    {
    "status": "SERVING"
    }

Additional context

No response

@github-actions github-actions bot added this to the release-1.11 milestone Jul 22, 2024
@BugenZhao
Copy link
Member

Hi, thanks for your feedback.

The recommended approach to do health checking is currently through SQL interface with pg_is_in_recovery() or rw_recovery_status(), which will be available in the upcoming v1.11 release and also adopted by RisingWave Cloud.

The gRPC health check is also a standard interface, but I'm afraid it's not correctly implemented yet.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 19, 2024

It's not trivial to define "liveness" here. For example:

  • If the cluster (i.e. Meta Service) is bootstrapping, what is the status of a compute node?
  • If the cluster (i.e. Meta Service) is under recovery, what is the status of a compute node? (Remember that the CN can serve batch queries now)
  • If the compute node lost the heartbeat with Meta service, what is the status of a compute node? (IIUC, CN can serve batch queries now)

@doup123
Copy link
Author

doup123 commented Aug 23, 2024

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy?
IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy.
I am sure that you can define better than me the conditions that should be satisfied by each component.

@fuyufjh
Copy link
Member

fuyufjh commented Sep 4, 2024

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy? IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy. I am sure that you can define better than me the conditions that should be satisfied by each component.

It's clear to define whether a cluster is healthy. However, this issue is talking about how to identify a component or a Pod (e.g. Compute node, Frontend Node, Meta node, compactor node, etc.) is healthy, right? This will be ambiguous...

@xxchan
Copy link
Member

xxchan commented Sep 5, 2024

May I ask did you meet any real problems that want to solve with liveness probe?

According to the use cases you mentioned:

To enable a liveness probe and restart any failed risingwave containers

I think this can be handled by Kubernetes (risingwave-operator).

To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step..


BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

@doup123
Copy link
Author

doup123 commented Sep 5, 2024

@xxchan thank you for your responses.

I think this can be handled by Kubernetes (risingwave-operator).
Probably you are correct on this

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step..
Considering the scenario that multiple clusters run anycasted via K8s in different geolocations, IMHO there should be a way to check if a cluster is operational or not, to send or withdraw traffic accordingly to it.

BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

The work that has been done with the exposed metrics via Prometheus and the corresponding dashboards is great and I will rely on it to check the "status" of the cluster. What I was trying to say is that it is a very common practice to have an HTTP endpoint with the health status of a service that can be directly used for monitoring/alerting.

@doup123 doup123 closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants