-
Notifications
You must be signed in to change notification settings - Fork 295
'kube-node-drainer' does not work 'kube2iam' in 0.9.9 #1105
Comments
I rebuilt the cluster without (The HTTP status code on http://169.254.169.254/latest/meta-data/spot/termination-time didn't seem to work. But I guess that is only for spot fleet instances?) Is it possible that Log from
Log from
|
I confirmed that I confirmed above that is all works fine when I guess we need to create and annotate IAM roles for the Log from
|
We're seeing the same issue with the nodeDrainer, thanks for the detailed report @whereisaaron - it helped us zero in on it. |
Thanks for looking into it @tyrannasaurusbanks! I believe the same issue applies to @mumoshu said there "...we should manage an IAM role dedicated to kube-resources-autosave which is discovered and assumed to be used by the autosave app via the kube2iam annotation." So I think we need to add IAM roles to the CloudFormation stack template for each of Also if you, like me, like to use the |
We've also ran into this and disabling kube2iam indeed removed the error in node-drainer-asg-status. Going forward though we would like kube2iam on so would be good to get to bottom of this. |
cheers for the crash course in kube2iam @whereisaaron, will get cracking! |
Hey sorry @tyrannasaurusbanks, you guys are probably the |
hah! No we're the opposite - total kube2iam newbs - your proposed changes make sense to me, let's review them in the PR i'm hoping to push today. |
Thanks everyone - I'm now suspecting if this is due to slow responses from kube2iam? Update: fixed wrong link |
@mumoshu it is not a timing thing is it? It looks like If it is timing somehow, you could also consider switching to |
@whereisaaron AFAIK, kube2iam is configured with And yes, kiam would be a good alternative anyway. Also see #1055! |
Does the node drainer boot before kube2iam? I’ve found several applications that boot before it get stuck like this, perhaps in some startup race condition. I filed an issue a while ago about the same thing. I thought perhaps always starting kube2iam first would solve it but now I’m thinking there are some additional problems. |
@c-knowles Thx! Possible. Would recreating pods fix the issue in that case? |
@c-knowles @mumoshu I think we should work out some proper IAM roles for It would be ideal to be able to enable namespace role filters, to limit the roles that pods can assume in different name spaces. But I am a little unsure about how to identify the minimum roles needed for |
@mumoshu unfortunately not, terminating the pod just means the new one also goes into crash loop backoff as well. It seems the cause is some 502s:
I can confirm though that at least one node in that same cluster has a On the health node, I've exec-ed to the pod and done this:
Seems like some problem with wget + kube2iam? |
@c-knowles Seems like you have a different problem than the one originally reported by this issue.
Probably. (Un)fortunately, my colleague @cw-sakamoto(thx!) spotted that yours can be fixed by replacing Probably related issue in wget in busybox: http://svn.dd-wrt.com/ticket/5771 |
Ah ok, I will file another issue for that then. |
@c-knowles Strange as we've seen same issues with |
@kiich Hi, thanks for the info and your work! Just curious but was the exact error you've seen in your cluster was However, what @whereisaaron encountered seems different. The log seems to indicate that |
@mumoshu Hi! We did find that kube2iam was OOM/Restarting lots so restarting the node (apologies, I said in my previous post that restarting the pod made it work but it was the node instead.) made it all work. We never dived in to deep to find out the cause but rather turned kube2iam off for now and look at alternatives for now. |
Though 1 thing to point out as it just came to me is Not sure if that is of any help though! |
@kiich Thanks for clarification! As far as I have read code of kube2iam, it doesn't survive node restarts. iptables rule added by kube2iam doesn't get updated after the kube2iam pod on the node is recreated with a different pod IP. That could be the cause of your hanging issue. |
@mumoshu Thanks for checking! Yeah that's the conclusion we've come to as well. I'm quite interested in looking into |
Note: Considering jtblin/kube2iam#9 and jtblin/kube2iam#126, probably what we'd need at best would be a dedicated pod to monitor kube2iam pod and add/remove/update iptables accordingly. The |
@kiich Thanks for confirming! Glad it wasn't only me who came to the conclusion. |
@pingles Hi, are you aware of this issue? In nutshell, managing iptables from within kube2iam seems like an incomplete solution/we need a external process to monitor it so that a corresponding iptables rule can be added/removed/updated accordingly. No missing route, no compromise. Is this solved somehow in the kiam community? |
@mumoshu kiam has some deferred removal code but I don't think it will solve most of these issues. I didn't spot anything related to monitoring yet. Both kube2iam and kiam run in host network mode so the pod IP should always match the node IP and hence it survives restarts, both projects use the coreos wrapper for iptables and call methods which check for duplicates. |
@c-knowles it's not something I was aware of. Do you mean there's a situation where that rule removal doesn't apply/work, or that it's not sufficient for all cases? |
@c-knowles Thx for the correction! @kiich Could you confirm that your kube2iam pod does report node ip as pod ip? |
@mumoshu Hi, it sure does!
whereas other pods report hostIP and podIP differently. |
@mumoshu no worries- be glad to try some tests to see whether we're susceptible to the same issues that you're hitting. Trying to think up something similar without using kube-aws and node-drainer. |
@pingles the problem experienced by me and others is that if Pods that need access to an IAM role start before Q1) With Q2) With The The |
@whereisaaron interesting, I'll try and get some time tomorrow to try and test it. I'll try and answer your questions though. Q1) kiam reuses a lot of the cache code we also broke the system into 2 processes: an agent which runs on all nodes running user workloads that should have their metadata api requests intercepted, and a server which runs on a subset of nodes (we run these on our masters that don't run other user-facing things). the server process is the only one that maintains the caches from the k8s api server and the only process that talks to aws. these together mean that, in general, there's less movement around the server processes than the agents- so by the time an agent starts there's already likely a server process in place. we also do a lot of retries + backoffs when performing rpc against the server to handle when pods aren't yet stored in the cache. this server process is super important to let us prefetch quickly, and to maintain a high cache hit rate (as well as reducing the number of AWS API calls we make). however, all this relies on the timeouts of the sdk clients being something sensible. the agent http server generally has an unlimited timeout up to when the client disconnects, so if the pod can't be found it'll keep retrying until its found or the client disconnects. we've noticed some client libraries behaving more strictly than others but in general its been ok- we encourage teams to ensure they have retries and backoffs around their aws operations that would be requesting credentials so they recover nicely also. Q2) the agent is what installs itself with iptables and uses a prerouting rule to capture stuff heading out of a particular interface destined for I'm not 100% certain of what happens should a pod be started before the kiam agent. my guess is that it'd end up trying to talk to the AWS metadata api and fail to retrieve credentials (our nodes have almost 0 IAM permissions) causing the process to exit and be restarted- in effect it wouldn't succeed until kiam was started. Hope that helps, happy to answer more here, on the Kubernetes slack (I'm @pingles there too) or you can email me (my GitHub profile has the address). |
Thanks for the extensive explanation @pingles! It sounds like using the 'informer' pattern is a smart move that should address the problems we are seeing with Regards the agent/server model that sounds good too, as agent restarts, or reboots of nodes with agents don't lose any credentials state. I wondered, how do the server replicas manage their state and agent access? Are they active/active load balanced or a controller election approach? I was hoping the former, both for scaling the number of agents and I was thinking it would be better for at least two servers to maintain cached credentials, so that the loss of one server wouldn't cause lots of latency across the whole cluster while the credentials are repopulated into the cache. Regards the clients, they ought to be delay tolerant, either back-off retries, or else fails health checks or exits so the scheduler can handle that for them. Pretty sure our 'node-drainer' tries to fetch credentials every cycle, so should be no problem there. Containers that get wedged because the first attempt wasn't answered fast enough should not apply 😄 |
@whereisaaron the servers are active/active- they run all run the same server process with agents connecting via a k8s All servers maintain a cache of all pods and credential cache (so we duplicate AWS requests for the same credentials across each server) but it means any server process can serve any request. It keeps it relatively simple and works well. Checking our data dog stats right now on our busiest cluster (around 1k pods) the role name handler has a 95th percentile response of 54ms (where it checks the role of an attached pod) and the credentials handler is at 88ms. |
Thanks @pingles sounds great, I'd love to try it in action with |
@whereisaaron good question. if a pod needs credentials it just needs the iam annotation. for us the k8s master components run on separate master nodes which have the relevant IAM policy granted directly (and so don't go via kiam). |
@pingles for |
@whereisaaron yep- we run api server and scheduler pods as host networked |
@pingles thanks for adding so much detail here! Between us it seems we'll be able to work this out pretty soon.
To clarify that, I just meant that the |
FYI I've just merged the KIAM support into master |
To finish this off, I think we probably need #1150 with the role for system pods generated. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I have
kube-aws
0.9.9 clusters (us-east-2) withkube-node-drainer
enabled and running, but it doesn't actually appear to be working. I don't see any sign of pods draining during ASG termination. The controller doesn't reschedule any pods until well after the instance has terminated and thenodeMonitorGracePeriod
has elapsed (which I set to 90s to test). All the pods then get rescheduled at the same instant.The ASG instances spend 5-10 minutes in
Terminating: Pending
state before transitioning toTerminating: Proceed
. At that point all connection to the instance drops. But only a couple minutes later do any pods get rescheduled.A log from a
kube-node-drainer
on a node being terminated is below. It gets a 404 response from the AWS API, right up to the point when the connection drops.The clusters have
kube2iam
installed, does that interfer or block the AWS API access forkube-node-drainer
?Output of
kubectl -n kube-system logs kube-node-drainer-ds-289hr --follow
The text was updated successfully, but these errors were encountered: