You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an operator,
In order to avoid long running k3s server job failures
I need that k3s wrapper release watches k8s api health endpoints using monit and automatically restart k3s-server upon long failures
healthchecker is added to a boshrelease as a monit process under the Job that is to be monitored. It is configured to perform a healthcheck against the main process in the Job. If healthchecker detects a failure, it will panic and exit. monit should be configured to run the restart-monit-job script on the failure of the healthchecker process. This script restarts the main monit process, up to ten failures in a row. After 10 consecutive failures, it gives up, since restarting the process is either in a horrible state, or the healthchecker is misconfigured and should not be causing process downtime.
check process gorouter
with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter.pid
start program "/var/vcap/jobs/bpm/bin/bpm start gorouter"
with timeout 60 seconds
stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter"
group vcap
check process gorouter-healthchecker
with pidfile /var/vcap/sys/run/bpm/gorouter/gorouter-healthchecker.pid
start program "/var/vcap/jobs/bpm/bin/bpm start gorouter -p gorouter-healthchecker"
with timeout 65 seconds
stop program "/var/vcap/jobs/bpm/bin/bpm stop gorouter -p gorouter-healthchecker"
if 1 restarts within 1 cycles then exec "/var/vcap/packages/routing-healthchecker/bin/restart-monit-job gorouter <%= p('healthchecker.failure_counter_file') %>"
depends on gorouter
group vcap
workaround
look at grafana dashboard and prometheus alerts for api server status
manually run the command within a bosh ssh session:
server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
Error from server (InternalError): an error on the server ("[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check failed") has prevented the request from succeeding
server/987e7d4a-4a42-4bd8-8453-cbbba847df2c:/var/vcap/sys/log/k3s-server# kubectl get --raw='/livez?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check passed
The text was updated successfully, but these errors were encountered:
As an operator,
In order to avoid long running k3s server job failures
I need that k3s wrapper release watches k8s api health endpoints using monit and automatically restart k3s-server upon long failures
References:
Sample
https://github.com/cloudfoundry/routing-release/blob/03cd155a7fec2a12de5aed7bbe1ebd220f655da3/jobs/gorouter/monit#L1-L15
workaround
look at grafana dashboard and prometheus alerts for api server status
manually run the command within a
bosh ssh session
:The text was updated successfully, but these errors were encountered: