Replies: 2 comments 1 reply
-
Please only use issues for issues with hetzner-k3s itself. For anything else there are discussions (I converted this to a discussion). You probably have some workload that uses more resources than allocated or doesn't even have requests and limits set, therefore causing instability with the nodes. It's unlikely a problem with k3s and surely not a problem with hetzner-k3s. If you can share more details on what you are running in the cluster I or others might be able to offer advice. |
Beta Was this translation helpful? Give feedback.
-
Hello, I have exactly the same problem, after a few days one of my nodes becomes unstable and I have to reboot it. |
Beta Was this translation helpful? Give feedback.
-
Hey,
Situation
After several days of running a cluster set up with hetzner-k3s it becomes unstable, with large request processing times, pods being restarted constantly, errors like
context deadline exceeded
and seemingly outgoing requests failing due to cilium-envoy failing as well.The cluster is set up with:
I re-created the cluster multiple times with the same result. I initially suspected that this could be related to throttling, since I was using the Shared vCPUs, for which Hetzner became more aggressive in ensuring they are not used too much. But the exact same issue also happens with Dedicated vCPUs.
Anomalies / Noticed errors
Apologies for the rather random mix of anomalies, as I don't know the root cause it is hard to differentiate between root causes and symptoms:
context deadline exceeded
error and timeouts:Multiple pods constantly failing and being restarted:
Failing outgoing requests / Timeouts
Monitoring
CPU usage is very high, yet this could just be a symptom of pods being constantly restarted.
This is the top output of the worker agent.
There is a lot of swapping being handled, yet this could also either be a symptom instead of the cause.
Cluster config
Additional
What complicates matters further is that I'm using Grafana Cloud as observability and, during this time of these issues, no metrics are being forwarded to Grafana Cloud anymore. I assume this is related to the same issue that is causing the failing outgoing requests (i.e. no pushes reaching Grafana Cloud anymore as well).
Weirdly, a similar or identical situation happens with clusters created by the terraform-hcloud-kube-hetzner project as well.
Lastly, there are some similarities to #424, as the containers crashing there did also crash for me and exhibited similar errors (e.g. the
context deadline exceeded
error and failing outgoing requests). As the root cause remained undiscovered there, this may be related or the same issue.Beta Was this translation helpful? Give feedback.
All reactions