Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3s v1.29.10 and v1.30.8 produces a lot of core dumps if CrowdStrike falcon-sensor is installed #11207

Closed
chilicat opened this issue Nov 4, 2024 · 9 comments

Comments

@chilicat
Copy link

chilicat commented Nov 4, 2024

Environmental Info:

Additional software:

  • CrowdStrike: falcon-sensor/now 7.17.0-17005

K3s Version:

k3s version v1.29.10+k3s1 (ae4df31)
go version go1.22.8

Node(s) CPU architecture, OS, and Version:

Linux host 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 server/agent

Describe the bug:

K3s produces during deployment of Helm charts a lot of cores dump. This issue was not present with 1.29.8 (did not test 1.29.9)

find /var/cores -mindepth 1 -maxdepth 1 -exec du -sh {} \;
41M     /var/cores/core.ctr-159828-1730629323-24.10.0.341
41M     /var/cores/core.ctr-160290-1730629323-24.10.0.341
45M     /var/cores/core.ctr-160360-1730629323-24.10.0.341
41M     /var/cores/core.ctr-160378-1730629323-24.10.0.341
45M     /var/cores/core.ctr-160413-1730629324-24.10.0.341

I believe one core dump is created per Pod crash.
The name of the core dumps suggests that the dump is created from ctr (containerd)

Steps To Reproduce:

Given:

  • Ubuntu 22.04
  • Crowdstrike falcon-sensor installed and service running
# install k3s 
export INSTALL_K3S_VERSION=v1.29.10+k3s1
curl -sfL https://get.k3s.io | sh 

# define path for the cores dumps
sysctl -w kernel.core_pattern=/var/crash/core

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh 

# install something...
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
helm install sw oci://registry-1.docker.io/bitnamicharts/seaweedfs

# wait until some pods are running 

# kill pods to provoke the cores dump
kubectl delete po --all -A

# core dumps are created...
ls /var/crash/
core.29127  core.31341  core.32124  core.33311
   

The issue is only present of falcon sensor is running.

# stop falcon
systemctl stop falcon-sensor 

# remove existing cores dumps
rm -fr /var/crash/core.*

# delete pods to provoke cores
kubectl delete po --all -A

# watch dump folder... none created!
watch ls /var/crash/

Expected behavior:

No core dumps, same behavior as version 1.29.8 (and earlier versions)

@chilicat
Copy link
Author

chilicat commented Nov 4, 2024

The problem is not specific to the nvidia device plugin but it was the easiest to reproduce.

I delete the pod:

kubectl delete po -n nvidia-device-plugin   nvdp-node-feature-discovery-worker-jv6kx  

Then I see new core dumps appear:

ls /var/cores/                                                                                                                                                                                                 kw-vm-41: Mon Nov  4 10:56:39 2024

core.ctr-38912-1730717331-24.10.0.341
core.ctr-39049-1730717332-24.10.0.341
core.ctr-39070-1730717332-24.10.0.341
core.ctr-39089-1730717332-24.10.0.341

Here are the K3s logs:

Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.449594    2572 event.go:376] "Event occurred" object="nvidia-device-plugin/nvdp-node-feature-discovery-worker" fieldPath="" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.458638    2572 topology_manager.go:215] "Topology Admit Handler" podUID="558daa24-f416-43f0-aa1d-70cfbeca2b8f" podNamespace="nvidia-device-plugin" podName="nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: E1104 10:48:51.458724    2572 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="a053d981-3197-43a5-b217-3c6dfceecde3" containerName="worker"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: E1104 10:48:51.458740    2572 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="53d81b56-b1d6-4770-8475-950a90394caa" containerName="worker"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.458789    2572 memory_manager.go:354] "RemoveStaleState removing state" podUID="a053d981-3197-43a5-b217-3c6dfceecde3" containerName="worker"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508082    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"host-usr-lib\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-usr-lib\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508138    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"host-boot\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-boot\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508167    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"host-sys\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-sys\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508192    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"host-lib\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-lib\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508226    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"nfd-worker-conf\" (UniqueName: \"kubernetes.io/configmap/a053d981-3197-43a5-b217-3c6dfceecde3-nfd-worker-conf\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508244    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-lib" (OuterVolumeSpecName: "host-lib") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "host-lib". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508258    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"source-d\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-source-d\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508218    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-boot" (OuterVolumeSpecName: "host-boot") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "host-boot". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508257    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-sys" (OuterVolumeSpecName: "host-sys") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "host-sys". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508218    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-usr-lib" (OuterVolumeSpecName: "host-usr-lib") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "host-usr-lib". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508284    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"features-d\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-features-d\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508311    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-source-d" (OuterVolumeSpecName: "source-d") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "source-d". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508318    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-features-d" (OuterVolumeSpecName: "features-d") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "features-d". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508350    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-os-release" (OuterVolumeSpecName: "host-os-release") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "host-os-release". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508352    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"host-os-release\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-os-release\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508398    2572 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"kube-api-access-wh7dc\" (UniqueName: \"kubernetes.io/projected/a053d981-3197-43a5-b217-3c6dfceecde3-kube-api-access-wh7dc\") pod \"a053d981-3197-43a5-b217-3c6dfceecde3\" (UID: \"a053d981-3197-43a5-b217-3c6dfceecde3\") "
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508485    2572 reconciler_common.go:300] "Volume detached for volume \"host-usr-lib\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-usr-lib\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508502    2572 reconciler_common.go:300] "Volume detached for volume \"host-boot\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-boot\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508516    2572 reconciler_common.go:300] "Volume detached for volume \"host-lib\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-lib\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508531    2572 reconciler_common.go:300] "Volume detached for volume \"host-sys\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-sys\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508546    2572 reconciler_common.go:300] "Volume detached for volume \"source-d\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-source-d\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508562    2572 reconciler_common.go:300] "Volume detached for volume \"host-os-release\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-host-os-release\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508577    2572 reconciler_common.go:300] "Volume detached for volume \"features-d\" (UniqueName: \"kubernetes.io/host-path/a053d981-3197-43a5-b217-3c6dfceecde3-features-d\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.508775    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/a053d981-3197-43a5-b217-3c6dfceecde3-nfd-worker-conf" (OuterVolumeSpecName: "nfd-worker-conf") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "nfd-worker-conf". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.513833    2572 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/a053d981-3197-43a5-b217-3c6dfceecde3-kube-api-access-wh7dc" (OuterVolumeSpecName: "kube-api-access-wh7dc") pod "a053d981-3197-43a5-b217-3c6dfceecde3" (UID: "a053d981-3197-43a5-b217-3c6dfceecde3"). InnerVolumeSpecName "kube-api-access-wh7dc". PluginName "kubernetes.io/projected", VolumeGidValue ""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609161    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"source-d\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-source-d\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609226    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"features-d\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-features-d\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609266    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-p997j\" (UniqueName: \"kubernetes.io/projected/558daa24-f416-43f0-aa1d-70cfbeca2b8f-kube-api-access-p997j\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609300    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-boot\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-host-boot\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609326    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-usr-lib\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-host-usr-lib\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609357    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-lib\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-host-lib\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609379    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"nfd-worker-conf\" (UniqueName: \"kubernetes.io/configmap/558daa24-f416-43f0-aa1d-70cfbeca2b8f-nfd-worker-conf\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609408    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-sys\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-host-sys\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609473    2572 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-os-release\" (UniqueName: \"kubernetes.io/host-path/558daa24-f416-43f0-aa1d-70cfbeca2b8f-host-os-release\") pod \"nvdp-node-feature-discovery-worker-x2tqf\" (UID: \"558daa24-f416-43f0-aa1d-70cfbeca2b8f\") " pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609538    2572 reconciler_common.go:300] "Volume detached for volume \"kube-api-access-wh7dc\" (UniqueName: \"kubernetes.io/projected/a053d981-3197-43a5-b217-3c6dfceecde3-kube-api-access-wh7dc\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.609561    2572 reconciler_common.go:300] "Volume detached for volume \"nfd-worker-conf\" (UniqueName: \"kubernetes.io/configmap/a053d981-3197-43a5-b217-3c6dfceecde3-nfd-worker-conf\") on node \"kw-vm-41\" DevicePath \"\""
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.750822    2572 scope.go:117] "RemoveContainer" containerID="adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.758256    2572 scope.go:117] "RemoveContainer" containerID="adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: E1104 10:48:51.758708    2572 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599\": not found" containerID="adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.758763    2572 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={"Type":"containerd","ID":"adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599"} err="failed to get container status \"adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599\": rpc error: code = NotFound desc = an error occurred when try to find container \"adf515c63256d096fc84128c781398d27b94fd25d694f5b3038574dede319599\": not found"
Nov 04 10:48:51 kw-vm-41 k3s[2572]: I1104 10:48:51.980409    2572 kubelet_volumes.go:163] "Cleaned up orphaned pod volumes dir" podUID="a053d981-3197-43a5-b217-3c6dfceecde3" path="/var/lib/kubelet/pods/a053d981-3197-43a5-b217-3c6dfceecde3/volumes"
Nov 04 10:48:52 kw-vm-41 k3s[2572]: I1104 10:48:52.770683    2572 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="nvidia-device-plugin/nvdp-node-feature-discovery-worker-x2tqf" podStartSLOduration=1.770632596 podStartE2EDuration="1.770632596s" podCreationTimestamp="2024-11-04 10:48:51 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-11-04 10:48:52.770190495 +0000 UTC m=+1294.436076613" watchObservedRunningTime="2024-11-04 10:48:52.770632596 +0000 UTC m=+1294.436518696"

@chilicat
Copy link
Author

chilicat commented Nov 4, 2024

The issue looks similar to:

Our system also has Crowdstrike falcon sensor installed: Unit: falcon-sensor.service

coredumpctl info
           PID: 77467 (ctr)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 31 (SYS)
     Timestamp: Mon 2024-11-04 11:52:02 UTC (1s ago)
  Command Line: $'ctr ' "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
    Executable: /var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s
 Control Group: /system.slice/falcon-sensor.service
          Unit: falcon-sensor.service
         Slice: system.slice
       Boot ID: a2a65af598f54e49bfa2a4f8562c2f2a
    Machine ID: 7adec23794d04727a494d59506c84af3
      Hostname: kw-vm-41
       Storage: none
       Message: Process 77467 (ctr) of user 0 dumped core.
                
                Found module /var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s with build-id: a598e1516c2bdfab5d6fa75382622b6cf2d5379b
                Found module linux-vdso.so.1 with build-id: 8f8bf0dc8238c446732bdfaf260a4b2e48bfc7a3
                Stack trace of thread 77480:
                #0  0x0000000000407c4e n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x7c4e)
                #1  0x000000000049e8ca n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x9e8ca)
                #2  0x000000000049f2bd n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x9f2bd)
                #3  0x0000000000549795 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x149795)
                #4  0x00000000005445d2 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x1445d2)
                #5  0x00000000018a550e n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x14a550e)
                #6  0x00000000018a6ac6 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x14a6ac6)
                #7  0x00000000004cc2a7 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0xcc2a7)
                #8  0x0000000000643716 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x243716)
                #9  0x00000000018a56e8 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x14a56e8)
                #10 0x00000000018db5bd n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x14db5bd)
                #11 0x000000000192fea5 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x152fea5)
                #12 0x000000000192fb08 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x152fb08)
                #13 0x000000000192f34f n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x152f34f)
                #14 0x000000000192d539 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x152d539)
                #15 0x0000000001928345 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x1528345)
                #16 0x00000000004802c1 n/a (/var/lib/rancher/k3s/data/9f4d91d896c15e475c3d62297a5940f714a93339400de6381dc3bb48257dc23a/bin/k3s + 0x802c1)

@brandond
Copy link
Member

brandond commented Nov 4, 2024

It sounds like crowdstrike is causing the ctr process to crash. ctr is the containerd client cli. I have no idea why your anti-malware tool would be interfering with it, but this sounds like something to raise with Crowdstrike, I don't see that there's anything wrong with K3s itself.

@brandond brandond closed this as completed Nov 4, 2024
@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Nov 4, 2024
@chilicat
Copy link
Author

chilicat commented Nov 5, 2024

The issue occurs with:

  • v1.29.10
  • v1.30.6

The issue is not reproducible with:

  • v1.29.9 or earlier

@chilicat chilicat changed the title K3s 1.29.10 produces a lot of core dumps K3s v1.29.10 and v1.30.8 produces a lot of core dumps if CrowdStrike falcon-sensor is installed Nov 5, 2024
@chilicat
Copy link
Author

chilicat commented Nov 5, 2024

I just updated the title and added more information, in case somebody runs into the same issue.

  • Crowdstrike falcon-sensor does not crash a running pod
  • However, it seems to produce the crash dump when Kubernetes removes/deletes a pod. So it seems like something in the shutdown process has changed between v1.29.8 and v.1.29.10 which is not liked by Crowdstrike.

@brandond
Copy link
Member

brandond commented Nov 5, 2024

We've bumped containerd regularly for the last couple releases: https://docs.k3s.io/release-notes/v1.29.X

I would encourage you to take this up with crowdstrike.

@chilicat
Copy link
Author

chilicat commented Nov 5, 2024

Thanks @brandond, I will try to bring it to crowdstrike.
If you don't mind, I will still update this ticket to collect the information. Might be helpful for others running into the issue.

@brandond
Copy link
Member

brandond commented Nov 5, 2024

I think you can feed go the core dump and the binary and it will give you an actual stack trace? I'm not sure specifically how to do that.

It would be interesting to know if something else is running ctr and falcon is making it crash, or if falcon itself is running ctr and ctr is crashing on its own when some event occurs (like a pod being terminated). You might see if you can figure that out? If its the latter and we can reproduce the crash without falcon involved, that would be something we can fix.

@chilicat
Copy link
Author

chilicat commented Nov 5, 2024

I have not worked with core dumps so far. I might give it a try later.

I suspect that Falcon is suspicious of the containerd fixes around: containerd/containerd#10589
for my reference:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants