Add monit health checks to detect and restart failing server job #77

gberche-orange · 2024-10-30T14:03:43Z

As an operator,
In order to avoid long running k3s server job failures
I need that k3s wrapper release watches k8s api health endpoints using monit and automatically restart k3s-server upon long failures

References:

healthchecker is added to a boshrelease as a monit process under the Job that is to be monitored. It is configured to perform a healthcheck against the main process in the Job. If healthchecker detects a failure, it will panic and exit. monit should be configured to run the restart-monit-job script on the failure of the healthchecker process. This script restarts the main monit process, up to ten failures in a row. After 10 consecutive failures, it gives up, since restarting the process is either in a horrible state, or the healthchecker is misconfigured and should not be causing process downtime.

https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html#dependencies

Consider the following common server setup:
WEB-SERVER -> APPLICATION-SERVER -> DATABASE -> FILESYSTEM
(a) (b) (c) (d)

If d does not run
When Monit runs it will first stop a, b and c then start d and finally start c, b then a.

Sample
https://github.com/cloudfoundry/routing-release/blob/03cd155a7fec2a12de5aed7bbe1ebd220f655da3/jobs/gorouter/monit#L1-L15

check process gorouter
  with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start gorouter"
    with timeout 60 seconds
  stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter"
  group vcap

check process gorouter-healthchecker
  with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter-healthchecker.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start gorouter -p gorouter-healthchecker"
    with timeout 65 seconds
  stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter -p gorouter-healthchecker"
  if 1 restarts within 1 cycles then exec "/var/vcap/packages/routing-healthchecker/bin/restart-monit-job gorouter  <%= p('healthchecker.failure_counter_file') %>"
  depends on gorouter
  group vcap

workaround

look at grafana dashboard and prometheus alerts for api server status

manually run the command within a bosh ssh session:

server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
Error from server (InternalError): an error on the server ("[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check failed") has prevented the request from succeeding

server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check passed

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add monit health checks to detect and restart failing server job #77

Add monit health checks to detect and restart failing server job #77

gberche-orange commented Oct 30, 2024 •

edited

Loading

Add monit health checks to detect and restart failing server job #77

Add monit health checks to detect and restart failing server job #77

Comments

gberche-orange commented Oct 30, 2024 • edited Loading

workaround

gberche-orange commented Oct 30, 2024 •

edited

Loading