Cluster becomes unstable after 'a few days' #453

nmehlei · 2024-09-18T15:11:28Z

nmehlei
Sep 18, 2024

Hey,

Situation

After several days of running a cluster set up with hetzner-k3s it becomes unstable, with large request processing times, pods being restarted constantly, errors like context deadline exceeded and seemingly outgoing requests failing due to cilium-envoy failing as well.

The cluster is set up with:

Cilium as CNI and Ingress
Istio as additional Mesh for exposing two APIs
Grafana Cloud integration
Around 30 pods of 10 services with essentially no load (FE and BE services that are not in use)

I re-created the cluster multiple times with the same result. I initially suspected that this could be related to throttling, since I was using the Shared vCPUs, for which Hetzner became more aggressive in ensuring they are not used too much. But the exact same issue also happens with Dedicated vCPUs.

Anomalies / Noticed errors

Apologies for the rather random mix of anomalies, as I don't know the root cause it is hard to differentiate between root causes and symptoms:

context deadline exceeded error and timeouts:

Multiple pods constantly failing and being restarted:

Failing outgoing requests / Timeouts

Monitoring

CPU usage is very high, yet this could just be a symptom of pods being constantly restarted.

This is the top output of the worker agent.

There is a lot of swapping being handled, yet this could also either be a symptom instead of the cause.

Cluster config

hetzner_token: [redacted]
cluster_name: mycluster
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.3+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "mycluster_id_ed25519.pub"
    private_key_path: "mycluster_id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: cilium
    cilium:
      chart_version: 1.16.1

datastore:
  mode: etcd # etcd (default) or external
  external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

masters_pool:
  instance_type: cx22
  instance_count: 1
  location: fsn1

worker_node_pools:
- name: worker-pool
  instance_type: ccx13
  instance_count: 1
  location: fsn1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 1

embedded_registry_mirror:
  enabled: true

api_server_hostname: mycluster.my-host.de

Additional

What complicates matters further is that I'm using Grafana Cloud as observability and, during this time of these issues, no metrics are being forwarded to Grafana Cloud anymore. I assume this is related to the same issue that is causing the failing outgoing requests (i.e. no pushes reaching Grafana Cloud anymore as well).

Weirdly, a similar or identical situation happens with clusters created by the terraform-hcloud-kube-hetzner project as well.

Lastly, there are some similarities to #424, as the containers crashing there did also crash for me and exhibited similar errors (e.g. the context deadline exceeded error and failing outgoing requests). As the root cause remained undiscovered there, this may be related or the same issue.

vitobotta · 2024-09-18T17:52:34Z

vitobotta
Sep 18, 2024
Maintainer

Please only use issues for issues with hetzner-k3s itself. For anything else there are discussions (I converted this to a discussion). You probably have some workload that uses more resources than allocated or doesn't even have requests and limits set, therefore causing instability with the nodes. It's unlikely a problem with k3s and surely not a problem with hetzner-k3s.

If you can share more details on what you are running in the cluster I or others might be able to offer advice.

0 replies

noooooooooooooooooooooa · 2024-10-19T14:22:04Z

noooooooooooooooooooooa
Oct 19, 2024

Hello, I have exactly the same problem, after a few days one of my nodes becomes unstable and I have to reboot it.
Have you found a solution yet?

1 reply

vitobotta Oct 21, 2024
Maintainer

Hi, see my previous comment in this thread. Typically instability in the cluster (provided there are no issues from the hardware and network point of view) are due to the lack of configuration for resource requests and limits on workloads, so it can happen that some pods exhaust resources on the nodes causing problems. This is a normal thing with Kubernetes in general. If you are sure that all your workloads have the resource requests and limits configured correctly then please share more details on your problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster becomes unstable after 'a few days' #453

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Cluster becomes unstable after 'a few days' #453

nmehlei Sep 18, 2024

Situation

Anomalies / Noticed errors

Monitoring

Cluster config

Additional

Replies: 2 comments · 1 reply

vitobotta Sep 18, 2024 Maintainer

noooooooooooooooooooooa Oct 19, 2024

vitobotta Oct 21, 2024 Maintainer

nmehlei
Sep 18, 2024

Replies: 2 comments 1 reply

vitobotta
Sep 18, 2024
Maintainer

noooooooooooooooooooooa
Oct 19, 2024

vitobotta Oct 21, 2024
Maintainer