Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monit health checks to detect and restart failing server job #77

Open
gberche-orange opened this issue Oct 30, 2024 · 0 comments
Open

Comments

@gberche-orange
Copy link
Member

gberche-orange commented Oct 30, 2024

As an operator,
In order to avoid long running k3s server job failures
I need that k3s wrapper release watches k8s api health endpoints using monit and automatically restart k3s-server upon long failures

References:

healthchecker is added to a boshrelease as a monit process under the Job that is to be monitored. It is configured to perform a healthcheck against the main process in the Job. If healthchecker detects a failure, it will panic and exit. monit should be configured to run the restart-monit-job script on the failure of the healthchecker process. This script restarts the main monit process, up to ten failures in a row. After 10 consecutive failures, it gives up, since restarting the process is either in a horrible state, or the healthchecker is misconfigured and should not be causing process downtime.

Consider the following common server setup:
WEB-SERVER -> APPLICATION-SERVER -> DATABASE -> FILESYSTEM
(a) (b) (c) (d)

If d does not run
When Monit runs it will first stop a, b and c then start d and finally start c, b then a.

Sample
https://github.com/cloudfoundry/routing-release/blob/03cd155a7fec2a12de5aed7bbe1ebd220f655da3/jobs/gorouter/monit#L1-L15

check process gorouter
  with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start gorouter"
    with timeout 60 seconds
  stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter"
  group vcap

check process gorouter-healthchecker
  with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter-healthchecker.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start gorouter -p gorouter-healthchecker"
    with timeout 65 seconds
  stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter -p gorouter-healthchecker"
  if 1 restarts within 1 cycles then exec "/var/vcap/packages/routing-healthchecker/bin/restart-monit-job gorouter  <%= p('healthchecker.failure_counter_file') %>"
  depends on gorouter
  group vcap

workaround

look at grafana dashboard and prometheus alerts for api server status

manually run the command within a bosh ssh session:

server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
Error from server (InternalError): an error on the server ("[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check failed") has prevented the request from succeeding

server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant