-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kube-proxy doesn't remove stale CNI-HOSTPORT-DNAT rule after Kubernetes upgrade to 1.26 #3440
Comments
Thanks for reporting this issue. I'll try to reproduce this. Note that:
Can you confirm that after doing the upgrade the old nginx pod is removed? |
thanks @manuelbuil , yes it is, the old pod is removed and a new one is started but the entry of the old one is still there. |
I am not able to reproduce the issue. Could it be that you did a |
@manuelbuil did you use RHEL 8? I was not able to reproduce on Oracle 9 nor on Ubuntu. |
Let me try again with something that might the reason why. BTW, the upgrade you are trying is not really supported because you are jumping several RKE versions and several minors: https://www.suse.com/support/kb/doc/?id=000020061. We must update the docs to avoid these big jumps |
@manuelbuil, good point I also tried one by one but it happened as well (src version 1.23). |
Here it is the pod removal and re-creation:
|
Unfortunately, I don't have access right now to a RHEL8 machine but I can't reproduce it on RHEL9. I tried changing a couple of things but I still see the pod being removed correctly. Could you try with a newer RKE version and see if you still get the error? |
Thanks @manuelbuil. On RHEL 9 I was not able to reproduce it as well. I just tested rke 1.4.11 and it doesn't work, I tested then with Calico and the issue is not present I suspect that it could be related to flannel. |
I have just tried on RHEL8 and still can't reproduce it. I think it must be something in your env. Could you check if the old IP is still in |
Are you able to reproduce it constantly or it just happened once? |
I kind of remember an issue with very old flannel versions in which sometimes there was a race condition which made flannel "forget" about removing the pods correctly. So the IPs stayed in that directory and the node was exhausting all IPs and was unable to create new pods. Maybe you are hitting that since |
Thanks @manuelbuil , I just checked but the file is not there. This is the result: After fresh install
And just one entry in the chain:
After the upgradeThe IP is not anymore there:
The new one is:
But now there are two entries in iptables which is causing the issue:
Thanks for your help! It's anything else do you suggest me to check? I also tried to change the flannel version from the very beginning by using: system_images:
flannel: rancher/mirrored-flannel-flannel:v0.23.0 But didn't help |
I'm able to reproduce it constantly unfortunately. Consider that I spawn a VM every time so the machine is brand new. Moreover even if I delete the pod only the last entry is updated, the first one remain there unless I reboot or I remove it manually. |
Strange, for some reason your portmap cni plugin is not doing his job correctly when deleting the pod in your RHEL 8.6 OS (BTW, I tried with 8.7 and it worked for me, could you perhaps upgrade?). The portmap code outputs some error logs https://github.com/containernetworking/plugins/blob/main/plugins/meta/portmap/portmap.go, could you check the logs of kubelet to see if there is a related error there? |
Hi @manuelbuil I checked kubelet logs (v=6), no error logs on portmap...I also moved to Oracle 8.7 just to be sure and it's happening as well...Could it be related to any sysctl parameter? Or is there a way to get all the logs from portmap? Thanks! |
Could you check the version: |
Yup, I already searched for that string but no results on the new spawned kubelet container. About versionsBefore:
After:
|
Does it happen too when you try with newer RKE versions? Just wondering if it only happens with that specific flannel version |
I added some debug messages to portmap and I saw some interesting things: On fresh install
On upgrade (new Kubelet container)
But this is not a known container, in fact the new one is:
So looks like the delete is called on a container ID that I do not see at all (running |
Thanks for the logs. However, flannel got the correct ID because the IP is gone. Strange. And if you query the docker pods, do you see that strange |
Not at all, I tried a couple of times and the ID it reports doesn't exist on docker |
RKE version:
v1.4.6
Docker version: (
docker version
,docker info
preferred)20.10.24
Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Openstack
cluster.yml file:
Steps to Reproduce:
Source versions -> rke: v1.4.3 - kubernetes_version: v1.23.10-rancher1-1
Dest versions -> rke: v.1.4.6 - kubernetes_version: v1.26.4-rancher2-1
Trying to upgrade Kubernetes with RKE from v1.23.10 to v1.26.4 I was not able anymore to reach my ingresses through the
nginx-ingress-controller
which listens on hostPort 80 and 443.I investigated further and I found that the
CNI-HOSTPORT-DNAT
Chain had still the old entry.Before the upgrade:
This looks good.
After the upgrade:
The second entry is the right one which points to the new pod IP but the first one should not be there.
Looks like
kube-proxy
doesn't delete the old entry making it impossible to access the ingresses.As workaround I had to reboot the server or delete manually the entry:
iptables -t nat -D CNI-HOSTPORT-DNAT 1
Results:
After the upgrade can't talk to the ingress controller listening on hostPort.
The text was updated successfully, but these errors were encountered: