Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make install and upgrade retries number configurable #394

Closed
wants to merge 5 commits into from

Conversation

ubergesundheit
Copy link
Member

@ubergesundheit ubergesundheit commented Nov 28, 2024

What does this PR do?

This PR enables users of the cluster chart to configure the install and upgrade retries of HelmReleases. During cluster upgrades, HelmRelease upgrade tries are exhausted because of non-ready nodes.

What is the effect of this change to users?

All components using HelmReleases can have different install and upgrade retries

Any background context you can provide?

giantswarm/roadmap#3664

Should this change be mentioned in the release notes?

  • CHANGELOG.md has been updated (if it exists)

@taylorbot
Copy link
Contributor

Hey @ubergesundheit, a test pull request has been created for you in the cluster-aws repo! Go to pull request giantswarm/cluster-aws#941 in order to test your cluster chart changes on AWS.

@ubergesundheit
Copy link
Member Author

Looks like tests seem to work. China and cilium ENI mode tests are failing but to some unrelated reasons (See here)

Copy link

github-actions bot commented Dec 3, 2024

There were differences in the rendered Helm template, please check! ⚠️

Output
=== Differences when rendered with values file helm/cluster/ci/test-cgroupsv1-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-required-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-mc-and-local-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-mc-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-only-local-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

@ubergesundheit ubergesundheit marked this pull request as ready for review December 3, 2024 13:05
@ubergesundheit ubergesundheit requested a review from a team as a code owner December 3, 2024 13:05
Copy link
Member

@AverageMarcus AverageMarcus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment here: giantswarm/roadmap#3664 (comment)

@Gacko
Copy link
Member

Gacko commented Dec 3, 2024

In general I agree with Marcus, but on the other hand this feels like setting timeouts in network apps: Of course you can always just set huge timeouts, but this sometimes hides root causes you want to fix. So I'm not sure if we should rather pick a high value or use -1.

Whichever way we go, having it configurable makes sense to me as you we might end up in an incident where it would be useful to not have it hardcoded one day.

@AverageMarcus
Copy link
Member

In general I agree with Marcus, but on the other hand this feels like setting timeouts in network apps: Of course you can always just set huge timeouts, but this sometimes hides root causes you want to fix. So I'm not sure if we should rather pick a high value or use -1.

We should still have alerting in place for when these are stuck pending for too long. That doesn't change. But all default apps (what we're talking about here) need to install for a WC to be considered successful. So there's no reason not to keep trying from what I see.

Whichever way we go, having it configurable makes sense to me as you we might end up in an incident where it would be useful to not have it hardcoded one day.

I don't think it adds anything to introduce more complexity "just in case". When we have an actual need, sure, but I don't see that we actually need to configure it right now. We just need it not to timeout.

@Gacko Gacko closed this Dec 14, 2024
@Gacko Gacko deleted the make-helmrelease-retries-configurable branch December 14, 2024 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants