Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Moving ETCD and controllers outside CloudFormation Nested Stacks #1112

Closed
camilb opened this issue Jan 15, 2018 · 75 comments
Closed

Moving ETCD and controllers outside CloudFormation Nested Stacks #1112

camilb opened this issue Jan 15, 2018 · 75 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@camilb
Copy link
Contributor

camilb commented Jan 15, 2018

Hi @mumoshu, I want to discuss some possible changes in CloudFormation especially regarding the NestedStacks and DesiredCapacity.

Had all the clusters upgraded to 1.9.1 last week. All went fine, except on the last cluster, the last NestedStack (a nodepool) failed to upgrade and started a RollBack.
The upgrade failed because a new instance was added and didn't respond by any form (no ssh, no ping, nothing in AWS console "Get System Logs"). Since CF is expecting
a signal from the launched instances, the quickest solution was to terminate the instance from console but it didn't launched a new one, so I increased the ASG capacity
to be able to receive the signal before a timeout but right after that I received a error in CloudFormation:
New SetDesiredCapacity value 20 is below min value 21 for the AutoScalingGroup.
And the RollBack started, rolling back all the node pools, controllers and ETCD for ~2h. It's not the first time when something like this happens, once the CloudFormation displayed a message that a new instance was added but didn't show in console so I had to change the ASG manually to avoid signal timeout.

On our CF stacks we never use DesiredCapacity for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs.
Then, regarding the NestedStacks, I think ETCD and controllers should be in separated stacks. If the update fails on workers, we should not rollback the ETCD and controllers that were successfully upgraded. Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade. Workers are quite safe to rollback.

Also, the RollBack process for NestedStacks takes a very long time for large clusters with multiple node pools. I know that using Nested Stacks it's much cleaner, but rolling back everything when we have updates in ETCD or Kubernetes version requires repeating the upgrade to recover the cluster, causing a very long downtime.

I' don't know how we can remove the DesiredCapacity from CF right now, but at least maybe we can move out ETCD and Controllers from Nested Stacks. What do you think?

@Fsero
Copy link
Contributor

Fsero commented Jan 15, 2018

@camilb i've been maintaining and testing several k8s clusters using kube-aws the last year and i do agree with your idea, it will be good that we can pin specific k8s version and amiID for the control plane and etcd and another one for workers. That will tremendously help upgrades and updates

@redbaron
Copy link
Contributor

AFAIK amiId can be pinned per each node pool already

@mumoshu
Copy link
Contributor

mumoshu commented Jan 15, 2018

@camilb Thanks a lot for sharing your hard-won experience.

Yes - I believe this should be "fixed" somehow.

On our CF stacks we never use DesiredCapacity for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs.

I agree DesiredCapacity should not be used anywhere. Actually, any kube-aws-generated stack-template does not include DesiredCapacity since #142 as far as I can remember.

Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade.

This is really worth noting!

So, we have two things to be considered separately, right?

  1. How to prevent uselessly rolling-back updates to etcd and controller nodes, when the cause of the failure was worker nodes' config or AWS(possibly transient failures).
  2. How to prevent executing a rolling-back a upgrade which involves backward-incompatible change(s), especially for controller and etcd nodes.

For 1., we could split the root stack into three as you've suggested. A downside of it would be that we may need an another workflow-engine like system to reliably update etcd, then controller, and finally worker nodes in this specific order?
Otherwise, if we don't need reliability here, we could just do it in the kube-aws command while instructing the user to re-run kube-aws update in case kube-aws failed at the middle of an update.

An alternative approach would be emitting an validation error when an user tried to update the whole cluster in an one-shot.

For 2., we need some domain-specific logics to determine an update is backward-incompatible or not? Otherwise, we can just give up automating it and just allow users to disable rollback via a command-line flag or a specific setting in cluster.yaml.

Regarding amiId, yea - as @redbaron noted, we can configure it per pool. Would it be a matter of documentation?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 15, 2018

Btw, my take on this problem is improving kube-aws by doing all of the below:

  • Allowing to split the root stack into three: etcd, controllers, worker nodepools.
  • Emit an validation error when changes in cluster.yaml is about to result in an update to the whole cluster.
  • Somehow instruct the user to double-check whether the update is backward-compatible or not, and suggest to explicitly disable rollback in case it is backward-incompatible.

What are your thoughts? Thanks!

@camilb
Copy link
Contributor Author

camilb commented Jan 15, 2018

Hi @mumoshu, thanks for you fast response. In most of the upgrades I prefer not touching the ETCD nodes. Since 07/2016, when I started using kube-aws I didn't have any issue with ETCD, on any cluster and I prefer to update it less ofen than Kubernetes".
So ideally I'm thinking having 3 separate stacks with the possibility to update ETCD and Kubernetes separately.

Something like:

  1. kube-aws update etcd for ETCD upgrade and kube-aws update to upgrade controllers and workers.
    or
  2. kube-aws update to update everything with a warning, kube-aws update etcd, kube-aws update controllers and kube-aws update workers.
    In the future we can benefit from a command like kube-aws upgrade workers for people that choose to use AWS's EKS for controllers and ETCD.

Also maybe we can add a bool in cluster.yaml like RollbackOnFailure to control the "Rollback on failure" CF stack option. Have nothing against the rollback when something fails, but this option will allow the user to control when to run the rollback. Like, in my case, only the last node pool failed but already had the instances running, except one, which was not a issue.

@Fsero
Copy link
Contributor

Fsero commented Jan 15, 2018

thanks @redbaron i didn't noticed it 🍺

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

@camilb Thanks!

Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?

Do you just leave already-updated controller nodes as-is and re-run the updates to worker nodes (with appropriate changes in worker config so that new workers can adapt to the recent changes in controller nodes)?

@camilb
Copy link
Contributor Author

camilb commented Jan 16, 2018

Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?

@mumoshu When Rollback on failure=No and a stack fails, a user can trigger the rollback manually from the AWS console: Actions ==> Rollback Stack.

If the workers are in a separate stack, we can rollback the workers only, as we can use controllers with greater version (I remember up to 3 releases) until we fix the issues and update the workers stack again.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

@camilb Thanks! I believe I understood that part.

Excuse me but what I wanted to sync up was that:

  • What you would do when worker nodes failed to be updated due to the "successfully" updated controller nodes? Theoretically, it could happen when e.g. controller SG had mistakenly configured to reject any connection from worker nodes, and then you trigger updates to worker nodes.

In that case, as the stack for controller nodes are successfully updated, you can't trigger a manual rollback, right?

@camilb
Copy link
Contributor Author

camilb commented Jan 16, 2018

@mumoshu I understand, unfortunately the only option in that case is to fix the controllers stack and update it again. But I think it can be observed as soon as the controllers finish the update. In that case the existing workers will start to fail and if the stacks can be upgraded in two separate commands, like on GKE where you have the option to update the controllers, all the workers, or just a nodepool gcloud container clusters upgrade --master, gcloud container clusters upgrade --cluster-version and gcloud container clusters upgrade --node-pool, then maybe we can add a option for controllers CF stack to fail if existing nodes goes NodeNotReady. In that case we can trigger the rollback for controllers before starting the workers upgrade.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

@camilb Thanks for the confirmation. I'd like to achieve a similar u/x at least.

At any circumstance, there shouldnt be a suprise such as "why my etcd/controller nodes are being updated even though my changes in cluster.yaml are solely for workers?".
Unfortunately this is the u/x we have now. Let take some action!

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

My updated suggestions for improvements:

  • Split the control-plane stack to two: controller and etcd
  • Add a flag usable like kube-aws update --only controlle,worker. If the etcd stack is being updated in this case, emit an validation error and tell the user to either include 'etcd' to the flag or revert recent modifications in cluster.yaml affecting etcd stack.
  • Improve kube-aws update to persist AMI IDs used across worker, etcd and controller nodes, in something like overrides.json suggested in Idea: kube-aws update --set key=value to override specific setting in cluster.yaml #1049 (comment) It should be named different - may be defaults.json?
    • AMI IDs in defaults.json is updated automatically by kube-aws update. kube-aws update --only worker updates AMI ID just for worker. So `kube-aws update
  • Add a flag --yes(y in short). A kube-aws update without the flag prompts the user to approve changes to stacks. Example: 3 stacks(etcd, worker, controller) are being updated. It may take a long time to roll-back when this update is failed at very end of the process. Do you procece? [y/n]. Exit with the same message but withou a prompt in case a tty is missing.
  • Add rollbackOnFailure for controller and etcd stack, respectively. We won't need one for workers?

Use-cases:

  • Use kube-aws update -y to retain existing behavior. Good for GitOps.
  • Use kube-aws update --only controller,worker to ever touch etcd stack, possibly set controller.rollbackOnFailure: false in cluster.yaml(@camilb)
  • Use kube-aws update for other use.

In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.

Down-sides:

  • It will be a backward-incompatible change i.e. kube-aws update on the two nested stacks cluster woud incur downtime starting the replacement of etcd, until controller stack becomes up.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2018

@camilb @Fsero How about the above idea? For me, it seemed to provide a smoother migration/development path while achieving your original goal?
@redbaron @c-knowles @danielfm Hi! Do you have any comments? (A backward-incompatible change again!

@cknowles
Copy link
Contributor

Sounds reasonable to align to the UX of gcloud, we use that as well and it seems to work well.

I’m not keen on the overrides.json as it splits the config in two. I’d prefer to pin the AMI on cluster.yaml generation or in source and then add a command to update it using the existing code. i.e. less surprise updates but slightly more surprise if a user is expecting everything to auto update.

@camilb
Copy link
Contributor Author

camilb commented Jan 16, 2018

@mumoshu From my point of view your proposal it's enough to avoid CF issues in the future, in the last one and a half year didn't have any major issues with kube-aws except the CF updates and almost everytime on worker updates.

Add rollbackOnFailure for controller and etcd stack, respectively. We won't need one for workers?

I don't see any downside by adding rollbackOnFailure on all stacks as the user can trigger the rollback whenever he wants but in some situations, in case of a failure, can allow user intervention first (resizing a pool, launching separate controllers/workers, etc).

Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack?
I think it will be safer.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 17, 2018

@c-knowles Thanks for your comment!

Yes, it would work too as long as we give up kube-aws itself to automatically set a latest k8s version and a latest AMI ID at runtime of kube-aws update for ease of use.

So, we could also enhance kube-aws here by:

  • Don't introduce something like overrides.json but instead
    • Populate kubernetesVersion and etcd.version for respective default values for the specific kube-aws version in generated cluster.yaml. kube-aws would ever update it automatically.
    • Emit validation errors on missing kubernetesVersion and/or amiId, etcd.version.
    • Emit validation errors on possibly unsupported values specified for kubernetesVersion and/or etcd.version, while printing the default versions embeded in the kube-aws binary being run.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 17, 2018

In case someone is actually relying on kube-aws update to automatically update k8s, etcd versions and AMI ID, I'd suggest writing a wrapper which perhaps run sed to replace version numbers automatically? 😄
Do you have something like that you could share with us?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 17, 2018

Implementation note: It would be possible to do detect which stack(s) are being updated by creating a cfn changeset and inspecting the result, so that we can emit a validation error accordingly.

@cknowles
Copy link
Contributor

What about something like:

  • First generation of cluster.yaml runs the same AMI ID gathering code as now and populates it in the yaml.
  • Add a new command to update cluster.yaml to latest version.
  • Warn if not present in cluster.yaml but still populate existing cluster.yaml with latest version.
  • Error if the AMI ID is too old to be supported, e.g. old docker version. Not sure this is achievable given a custom AMI can be used.

Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 17, 2018

Hi @c-knowles!
I generally agree with the direction.

Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.

Regarding etcd and k8s versions, I wanted to include them for auto populations as they do result in node replacement once you update the kube-aws binary, even no cluster.yaml or stack templates are changed.

Add a new command to update cluster.yaml to latest version.

How would you like kube-aws to achieve it?
There's no yaml parser capable of preserving blankline and comments afaik 😢
I'm afraid that a simple text replacement would beak in various ways.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 22, 2018

Updated proposal for improving kube-aws for less surprises while upgrading

  • Split the control-plane stack to two: controller and etcd
    • Backward-compatibility: Do we need a flag to selectively keep the existing architecture(a control-plane stack w/ controller and etcd)
  • Add a flag usable like kube-aws update --only controlle,worker.
  • Improve kube-aws update to persist AMI IDs used across worker, etcd and controller nodes
  • Add a flag --yes(y in short).
    • A kube-aws update without the flag prompts the user to approve changes to stacks. Example: 3 stacks(etcd, worker, controller) are being updated. It may take a long time to roll-back when this update is failed at very end of the process. Do you like to proceed? [y/n]. Exit with the same message but withou a prompt in case a tty is missing.
  • Add rollbackOnFailure for controller and etcd stack, respectively.
    • We won't need one for workers?

Use-cases:

  • Use kube-aws update -y to retain existing behavior. Good for GitOps.
  • Use kube-aws update --only controller,worker to ever touch etcd stack, possibly set controller.rollbackOnFailure: false in cluster.yaml(@camilb)
  • Use kube-aws update for other use.

In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.

Down-sides:

  • It will be a backward-incompatible change i.e. kube-aws update on the two nested stacks cluster woud incur downtime starting the replacement of etcd, until controller stack becomes up.

@whereisaaron
Copy link
Contributor

Interesting discussion and some good idea. The ideas that jumped out as great to me were:

  1. Being able to disable the automatic rollback: I've been doing a lot of deployments with 0.9.9, and when something is wrong, it become a race between me and CF to diagnose the root cause before CF destroys the evidence. I'd love the option to disable it sometimes.

  2. Inject amiid's into cluster.yaml on init: I agree with @c-knowles that the separate json file sounds clumsy and maybe mysterious to users. I'd rather explicit pinned ami's be added to cluster.yaml, even if I have to manually update those from time to time. It would be a nice assist if kubectl advisor would gather and list the latest relevant amiid and k8s versions available to help me with that. Later, if a reliable YAML patcher is available we could integrate the option to auto-update the cluster.yaml.

  3. Separate stacks: Compared with the old node-pool system (circa 0.9.3) I love being able to have all the settings in one cluster.yaml. But I don't actually enjoy the nested CF stacks. It scares that bacon off me, when updating worker node tags or instance type or something, that I'm going anywhere near the control plane. And when deploying many times to work out the kinks with 0.9.9 configs, I wasted a lot of time watching the etcd cluster being slowly made and unnecessarily destroyed, while I worked out kinks with controller and worker deployment.

@whereisaaron
Copy link
Contributor

@camilb I'd love to hear your experience with upgrading k8s versions in kube-aws clusters. It has never really been a topic in the kube-aws documentation and often it has been noted as 'not yet supported', 'won't include etcd', or 'might not work'.

Are you able to deploy 1.9 clusters with kube-aws 0.9.9?

Are you able to upgrade 1.8 clusters to 1.9 with kube-aws 0.9.9?

Do you just update the hypercube version and go for it?

Any tips or gotcha's you can share or add to the docs somewhere?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 29, 2018

Thanks for your feedback!

@mumoshu
Copy link
Contributor

mumoshu commented Jan 29, 2018

Do we need backward-compatibility for this, e.g. a flag to split stacks or not?

Also, running kube-aws update on existing clusters after this change would recrete the whole controle-planes and kube-aws-managed vpc/subnets/etc, which indeed introduces downtime.

I'm ok with recreating every cluster cuz I consider my clusters cattles rather than pets. How about you, everyone?
Any idea if you need a better update path?

@camilb
Copy link
Contributor Author

camilb commented Jan 29, 2018

@whereisaaron I'm unsing the master branch to build kube-aws most of the times. Currently all of my clusters are running the latest changes in master. Some were upgraded from 1.8.x to 1.9.1 and the largest one from 1.7.8 to 1.9.1.
Most of the times everything works fine, but had situations where I had to regenerate all the service acoounts tokens and restart most of the pods in kube-system. Except this and CF errors, never had another issue upgrading kube-aws in almost 2 years.
We still not have a method to upgrade the kubernetes version only, so all the instances are replaced by kube-aws update. And the network is managed by a separate stack, similar to this one

@camilb
Copy link
Contributor Author

camilb commented Jan 29, 2018

@mumoshu I'm planning to migrate existing clusters to new ones, without updating.

@whereisaaron
Copy link
Contributor

@mumoshu when you have a large collection of heterogeneous applications in a cluster, the clusters may be cattle, but the applications on them are like pet ticks on the cow you have to locate and transplant to the next beast 😄

I hopefully at least minor (x.y.z) version in-place upgrades will be possible/supported, again, trying to match the ease of upgrade GKE and the like give you. Separate stacks does make this a little easier, since node pools and controllers are pretty much cattle within a cluster.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 20, 2018

I'll be merging #1233 and cut v0.9.11-rc.1, once v0.9.10 is released. Any comments, opinions, etc? 😃

@kevtaylor
Copy link
Contributor

@mumoshu Just one comment so far - is it possible to split the worker stacks with this implementation. ie. If I have a multiple nodepool environment, I'd like to roll the nodepool stacks independently

@mumoshu
Copy link
Contributor

mumoshu commented Apr 20, 2018

@kevtaylor It won't be so hard to implement a naive support for that like kube-aws update --tarrgets nodepool1 rolls the single nodepool only, where you have a node pool with name: nodepool1 in cluster.yaml. Would it be ok for you?

Anyway, may I ask you about your exact use-case for that?

I guess that if you had modified the first node pool only in your cluster.yaml, kube-aws update --targets worker would ever affect second and the following pools.

Oh, maybe does it relate with amiId, which is propagated from top-level to every node pool?

@mumoshu
Copy link
Contributor

mumoshu commented Apr 20, 2018

Thank you so much for the comment, anyway!

@kevtaylor
Copy link
Contributor

So the types of things we do are split node pools into separate groups
Group A may have c4 instance and be our standard pool - called NODEPOOLA, NODEPOOLB etc.
Group B may have d instance types and tainting - called NODEPOOL-BIGA, NODEPOOL-BIGB

We then add tolerations to certain pod types which then favour that nodepool

We would still use a consistent AMI across the nodepools

But we may want to say, change the instance type of Group A, and then just roll that pool - or if we wanted to change an AMI id - just test that on a given stack first etc.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 20, 2018

@kevtaylor Thanks for the explanation! It is now clearer to me.

How about kube-aws update --targets nodepool-biga,nodepool-bigb?
I have already implemented the ability to select multiple targets at a time so --targets etcd,controller is possible. I can extend the current impl to able to also target multiple but not all node pools.

Would it sound good?

@kevtaylor
Copy link
Contributor

@mumoshu That looks spot on to me

@mumoshu
Copy link
Contributor

mumoshu commented Apr 24, 2018

@kevtaylor Thanks for your confirmation 👍 Just implemented it into #1233

mumoshu added a commit that referenced this issue May 10, 2018
#1233)

* Prompt before updating

* feat: Support updating a subset of cfn stacks only

* Vendor changes for the stack subset update feature

* Rename some func-scoped variables for clarity

* feat: Extract etcd and network stack for more fine-grained cluster update

* feat: Ability to specify just one or more node pools for updates

* fix: Make update work with `--targets all`

Ref #1112
@davidmccormick
Copy link
Contributor

Hi, I just rolled the lasted master into an existing 0.9.10 cluster and it did something strange with the etcd's - it created 3 new ones and left the old ones. I think we are going to have to think about how to make this migrate from old to new separated stacks.

@cknowles
Copy link
Contributor

I guess as it's a separate stack now nothing is managing that migration. Are we still advising to do a blue/green cluster switchover? We should also update the CLI reference in the docs.

@davidmccormick
Copy link
Contributor

I don't think we should be allowing this to roll into an existing cluster if the result is going to be to effectively remove all the existing state. I think that we need a migration path or a hard stop that prevents users from accidentally wiping their existing clusters.

@mumoshu
Copy link
Contributor

mumoshu commented Jun 4, 2018 via email

@davidmccormick
Copy link
Contributor

davidmccormick commented Jun 4, 2018

That sounds like a very good quick safety change and could give us some time to put together a more comprehensive solution! :)

Regarding a migration strategy, I'm not very familiar with Cloud Formation, but have noticed three interesting looking tags, e.g.

aws:cloudformation:logical-id : Etcd0EBS
aws:cloudformation:stack-id : arn:aws:cloudformation:us-west-2:034324643013:stack/davem-lab-secure-Etcd-URJB6X4OXGUT/8c400ba0-5f75-11e8-902e-503acbd4dc29
aws:cloudformation:stack-name : davem-lab-secure-Etcd-URJB6X4OXGUT

Does it sound plausible to write a function in kube-aws that looks for legacy etcd resources, instances, volumes (etc.) and re-tag them with the with the stack-id of the new stack? Will this make cloud-formation treat these as members of the new stack? I have no idea if this would work but if it does it might give us a clean migration path.

Anyone with more CF experience can give me a steer whether this solution is worth-while spending some time to try out and test it?

@davidmccormick
Copy link
Contributor

davidmccormick commented Jun 5, 2018

Thinking some more about it, I don't think that my suggestion above will work - given that the we are creating a new etcd stack rather than updating it perhaps we could expect an error about clashing with existing resources. There is clearly some complexity here regarding how CloudFormation works and my causual investigation so far hasn't thrown much up in the way of people migrating resources across different stacks.

@davidmccormick
Copy link
Contributor

The first problem that I find when trying to update a 0.9.9 cluster to the new Etcd stack is that the new ETCD0 fails to send it's cfn-signal for some reason. I could never fathom how we get over this in a new clean cluster so I have created a PR that will bring the ETCD's up in parrallel upon new creation of the stack and thus avoid this problem #1357

@mumoshu
Copy link
Contributor

mumoshu commented Sep 28, 2018

Oh, did we miss implementing rollbackOnFailure: false that has been proposed in the middle of this thread?

@davidmccormick
Copy link
Contributor

Hiyah, I didn't disable rollback as part of the etcd migration code because generally if the etcd migration failed then rolling back is a good thing. If the controllers fail to come up then this should also trigger a roll back before the old etcds are deleted. Is this in relation to a desired feature or an issue upgradings?

@mumoshu
Copy link
Contributor

mumoshu commented Sep 28, 2018

@davidmccormick As far as I remember, rollbackOnFailure: false was mostly wanted for disabling rollback for worker nodes only. It is useful when there are a large number of worker nodes. Recreating all those just due to e.g. temporary EC2/ASG issue/instability isn't desirable.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests