Command(s): curl https://raw.githubusercontent.com/mattmattox/k8s-troubleshooting/master/kube-scheduler | bash
Example Output of a healthy cluster
kube-scheduler is the leader on node a1ubk8slabl03
Command(s): docker exec etcd etcdctl member list
Example Output of a healthy cluster
2f080bc6ec98f39b, started, etcd-a1ubrkeat03, https://172.27.5.33:2380, https://172.27.5.33:2379,https://172.27.5.33:4001, false
9d7204f89b221ba3, started, etcd-a1ubrkeat01, https://172.27.5.31:2380, https://172.27.5.31:2379,https://172.27.5.31:4001, false
bd37bc0dc2e990b6, started, etcd-a1ubrkeat02, https://172.27.5.32:2380, https://172.27.5.32:2379,https://172.27.5.32:4001, false
Command(s): curl https://raw.githubusercontent.com/mattmattox/etcd-troubleshooting/master/etcd-endpoints | bash
Example Output of a healthy cluster
Validating connection to https://172.27.5.33:2379/health
{"health":"true"}
Validating connection to https://172.27.5.31:2379/health
{"health":"true"}
Validating connection to https://172.27.5.32:2379/health
{"health":"true"}
health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused
A connection to the address shown on port 2380 cannot be established. Check if the etcd container is running on the host with the address shown.
xxx is starting a new election at term x
The etcd cluster has lost it’s quorum and is trying to establish a new leader. This can happen when the majority of the nodes running etcd go down/unreachable.
connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}
The host firewall is preventing network communication.
rafthttp: request cluster ID mismatch
The node with the etcd instance logging rafthttp: request cluster ID mismatch is trying to join a cluster that has already been formed with another peer. The node should be removed from the cluster, and re-added.
rafthttp: failed to find member
The cluster state (/var/lib/etcd) contains wrong information to join the cluster. The node should be removed from the cluster, the state directory should be cleaned and the node should be re-added.
curl -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) https://localhost:2379/config/local/log
curl -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) https://localhost:2379/config/local/log
curl -X GET --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) https://localhost:2379/metrics
wal_fsync_duration_seconds (99% under 10 ms)
A wal_fsync is called when etcd persists its log entries to disk before applying them.
backend_commit_duration_seconds (99% under 25 ms)
A backend_commit is called when etcd commits an incremental snapshot of its most recent changes to disk.
Run the following script on each controlplane node
https://raw.githubusercontent.com/mattmattox/k8s-troubleshooting/master/kube-apiserver-check-etcd
Check kubelet logging
As this is the node agent, it will contain the most information regarding operations that it is executing based on scheduling requests
Check kubelet stats
https://raw.githubusercontent.com/mattmattox/k8s-troubleshooting/master/kubelet-stats