You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environmental Info:
K3s Version:
k3s version v1.31.1+k3s1 (452dbbc)
go version go1.22.6
Node(s) CPU architecture, OS, and Version: Linux consolidated-ch-3 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 masters in HA mode, 3 agents
Describe the bug:
The pod launches the script, which is responsible for connecting to the DB (intranet, outside k3s cluster) and execute a script there. The script starts execution and tries to connect to DB at the very 1st second of pod's life and faces an error:
[2024-12-19, 17:29:26 UTC] {pod_manager.py:490} INFO - [base] clickhouse_driver.errors.NetworkError: Code: 210. No route to host (10.15.21.167:9000)
After debugging I have discovered, that the IP is not available only during the 1st second of pod's life, later it becomes accessible. I have written a demo which reproduces the issue on my cluster:
apiVersion: batch/v1
kind: Job
metadata:
name: if-ip-accessible-job
namespace: demo
spec:
template:
spec:
containers:
- name: master
image: "docker.io/busybox:1.36"
env:
- name: IP_ADDRESS
value: "10.15.21.167"
- name: ATTEMPTS
value: "3"
command:
- /bin/sh
- "-c"
- |2-
is_failed=0
cnt=0
echo "If ${IP_ADDRESS} accessible ..."
# Trying number of times but finally fail if any fail occurs
while [ ${cnt} -le ${ATTEMPTS} ]
do
echo "$( date -u '+%FT%T' ) Attempt #${cnt} ..."
ping -c 1 ${IP_ADDRESS}
if [ $? -ne 0 ]; then
echo "FAILED attempt!"
is_failed=1
else
echo "SUCCESS attempt!"
break
fi
true $(( cnt++ ))
done
if [ ${is_failed} -ne 0 ]; then
echo "FAILED!"
exit 1
fi
echo "SUCCESS!"
exit 0
restartPolicy: Never
backoffLimit: 5
It produces the log like this:
If 10.15.21.167 accessible ...
2024-12-20T13:35:09 Attempt #0 ...
PING 10.15.21.167 (10.15.21.167): 56 data bytes
--- 10.15.21.167 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
FAILED attempt!
2024-12-20T13:35:19 Attempt #1 ...
PING 10.15.21.167 (10.15.21.167): 56 data bytes
64 bytes from 10.15.21.167: seq=0 ttl=63 time=0.273 ms
--- 10.15.21.167 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.273/0.273/0.273 ms
SUCCESS attempt!
FAILED!
Unfortunately I am unable to get any info from logs on the worker node with sudo journalctl -u k3s-agent -n 100 -f:
...
Dec 20 13:35:09 consolidated-ch-3 k3s[2841]: I1220 13:35:09.029542 2841 reconciler_common.go:245] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-s2xqq\" (UniqueName: \"kubernetes.io/projected/7761d314-8130-44de-8456-5036d4b60b0a-kube-api-access-s2xqq\") pod \"if-ip-accessible-job-fc9qx\" (UID: \"7761d314-8130-44de-8456-5036d4b60b0a\") " pod="demo/if-ip-accessible-job-fc9qx"
Dec 20 13:35:09 consolidated-ch-3 k3s[2841]: I1220 13:35:09.933052 2841 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="demo/if-ip-accessible-job-fc9qx" podStartSLOduration=1.93301954 podStartE2EDuration="1.93301954s" podCreationTimestamp="2024-12-20 13:35:08 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-12-20 13:35:09.928411281 +0000 UTC m=+2744912.682909355" watchObservedRunningTime="2024-12-20 13:35:09.93301954 +0000 UTC m=+2744912.687517594"
Dec 20 13:35:21 consolidated-ch-3 k3s[2841]: I1220 13:35:21.260561 2841 reconciler_common.go:159] "operationExecutor.UnmountVolume started for volume \"kube-api-access-s2xqq\" (UniqueName: \"kubernetes.io/projected/7761d314-8130-44de-8456-5036d4b60b0a-kube-api-access-s2xqq\") pod \"7761d314-8130-44de-8456-5036d4b60b0a\" (UID: \"7761d314-8130-44de-8456-5036d4b60b0a\") "
Dec 20 13:35:21 consolidated-ch-3 k3s[2841]: I1220 13:35:21.263130 2841 operation_generator.go:803] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/7761d314-8130-44de-8456-5036d4b60b0a-kube-api-access-s2xqq" (OuterVolumeSpecName: "kube-api-access-s2xqq") pod "7761d314-8130-44de-8456-5036d4b60b0a" (UID: "7761d314-8130-44de-8456-5036d4b60b0a"). InnerVolumeSpecName "kube-api-access-s2xqq". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 20 13:35:21 consolidated-ch-3 k3s[2841]: I1220 13:35:21.361700 2841 reconciler_common.go:288] "Volume detached for volume \"kube-api-access-s2xqq\" (UniqueName: \"kubernetes.io/projected/7761d314-8130-44de-8456-5036d4b60b0a-kube-api-access-s2xqq\") on node \"consolidated-ch-3\" DevicePath \"\""
...
Steps To Reproduce:
Execute the example job above.
The cluster is installed via standard k3s script via ansible playbook. On the initial master the k3s_token is generated and set as param to other masters and agents. The flannel_iface is set on each node because I have 2 interfaces - one for intranet, one for external IPs. We use internal interfaces for the whole setup.
Pods must be resilient against being started up with different network connectivity than expected. If you need to make sure the pod can reach certain destinations before being started, you can use an init container to wait for those destinations to be reachable before kubelet starts the app containers.
tl;dr don't assume that pod networking has settled just because the pod is running.
Environmental Info:
K3s Version:
k3s version v1.31.1+k3s1 (452dbbc)
go version go1.22.6
Node(s) CPU architecture, OS, and Version:
Linux consolidated-ch-3 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 masters in HA mode, 3 agents
Describe the bug:
The pod launches the script, which is responsible for connecting to the DB (intranet, outside k3s cluster) and execute a script there. The script starts execution and tries to connect to DB at the very 1st second of pod's life and faces an error:
After debugging I have discovered, that the IP is not available only during the 1st second of pod's life, later it becomes accessible. I have written a demo which reproduces the issue on my cluster:
It produces the log like this:
Unfortunately I am unable to get any info from logs on the worker node with
sudo journalctl -u k3s-agent -n 100 -f
:Steps To Reproduce:
Execute the example job above.
The cluster is installed via standard k3s script via ansible playbook. On the initial master the
k3s_token
is generated and set as param to other masters and agents. Theflannel_iface
is set on each node because I have 2 interfaces - one for intranet, one for external IPs. We use internal interfaces for the whole setup.Expected behavior:
IP should be available at the moment pod starts execution. The error should be reflected in logs
Actual behavior:
IP is not accessible during the 1st second of pod's execution
Additional context / logs:
Nothing to add. Will collect and add on request
The text was updated successfully, but these errors were encountered: