-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on-cluster health checks - [Epic] #141
Labels
Milestone
Comments
philbrookes
changed the title
on-cluster health checks
Feature: on-cluster health checks
May 28, 2024
philbrookes
added
kind/feature
kind/epic
Epic
and removed
kind/epic
Epic
kind/feature
labels
Jun 4, 2024
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Prior Art
https://github.com/Kuadrant/multicluster-gateway-controller/tree/60f13a1f7ad8f2b82e3f344a425285f69fb91223/pkg/dns/health
Terminology
Tasks
Executing health checks
Consulting health checks
E2E Test cases
done under #282
done under x
Black box testing
Load Testing
Documenting Health Checks
Current State
Use cases we want to solve
Proposed approach
We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.
This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.
The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.
When is a probe unhealthy
A probe will write to a probe CR a few pieces of information:
When is a record unhealthy
The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).
When are unhealthy records removed from the zone
A record is removed from the zone if:
Before removing a record, the zone will be consulted. After a record is removed, the owner of that record will go into a validation loop to ensure at least one record will be returned (for its GEO or globally) if no records are found it will re-publish its own record (regardless of health)
Update our tests to include tests of the health check probes.
Tradeoffs
Related Information
initial thoughts on health checks, and potential for cross-cluster health checks in the future: here
The text was updated successfully, but these errors were encountered: