Make install and upgrade retries number configurable #394

ubergesundheit · 2024-11-28T12:05:15Z

What does this PR do?

This PR enables users of the cluster chart to configure the install and upgrade retries of HelmReleases. During cluster upgrades, HelmRelease upgrade tries are exhausted because of non-ready nodes.

What is the effect of this change to users?

All components using HelmReleases can have different install and upgrade retries

Any background context you can provide?

giantswarm/roadmap#3664

Should this change be mentioned in the release notes?

CHANGELOG.md has been updated (if it exists)

taylorbot · 2024-11-28T13:20:15Z

Hey @ubergesundheit, a test pull request has been created for you in the cluster-aws repo! Go to pull request giantswarm/cluster-aws#941 in order to test your cluster chart changes on AWS.

ubergesundheit · 2024-12-03T13:05:45Z

Looks like tests seem to work. China and cilium ENI mode tests are failing but to some unrelated reasons (See here)

github-actions · 2024-12-03T13:05:46Z

There were differences in the rendered Helm template, please check! ⚠️

Output

=== Differences when rendered with values file helm/cluster/ci/test-cgroupsv1-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-required-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-mc-and-local-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-mc-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30



=== Differences when rendered with values file helm/cluster/ci/test-zot-only-local-values.yaml ===

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  + one map entry added:
    upgrade:
      remediation:
        retries: 40

/spec/install/remediation/retries  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-cilium)
  ± value change
    - 30
    + 40

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-coredns)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-network-policies)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

/spec  (helm.toolkit.fluxcd.io/v2beta1/HelmRelease/org-giantswarm/awesome-vertical-pod-autoscaler-crd)
  + one map entry added:
    upgrade:
      remediation:
        retries: 30

AverageMarcus

Please see my comment here: giantswarm/roadmap#3664 (comment)

Gacko · 2024-12-03T15:04:41Z

In general I agree with Marcus, but on the other hand this feels like setting timeouts in network apps: Of course you can always just set huge timeouts, but this sometimes hides root causes you want to fix. So I'm not sure if we should rather pick a high value or use -1.

Whichever way we go, having it configurable makes sense to me as you we might end up in an incident where it would be useful to not have it hardcoded one day.

AverageMarcus · 2024-12-03T15:14:39Z

In general I agree with Marcus, but on the other hand this feels like setting timeouts in network apps: Of course you can always just set huge timeouts, but this sometimes hides root causes you want to fix. So I'm not sure if we should rather pick a high value or use -1.

We should still have alerting in place for when these are stuck pending for too long. That doesn't change. But all default apps (what we're talking about here) need to install for a WC to be considered successful. So there's no reason not to keep trying from what I see.

Whichever way we go, having it configurable makes sense to me as you we might end up in an incident where it would be useful to not have it hardcoded one day.

I don't think it adds anything to introduce more complexity "just in case". When we have an actual need, sure, but I don't see that we actually need to configure it right now. We just need it not to timeout.

ubergesundheit added 4 commits November 28, 2024 13:04

Make install and upgrade retries number configurable

c270476

Determine latest release using gh for provider test PR

dd8cdba

Specify github token for using gh

c9b5eb7

REVERT ME

938ca89

taylorbot mentioned this pull request Nov 28, 2024

Test cluster chart PR #394 giantswarm/cluster-aws#941

Closed

ubergesundheit mentioned this pull request Dec 2, 2024

Cilium helmrelease max retries exhausted on larger clusters giantswarm/roadmap#3664

Open

Merge branch 'main' into make-helmrelease-retries-configurable

2533dea

ubergesundheit marked this pull request as ready for review December 3, 2024 13:05

ubergesundheit requested a review from a team as a code owner December 3, 2024 13:05

AverageMarcus requested changes Dec 3, 2024

View reviewed changes

Gacko closed this Dec 14, 2024

Gacko deleted the make-helmrelease-retries-configurable branch December 14, 2024 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make install and upgrade retries number configurable #394

Make install and upgrade retries number configurable #394

ubergesundheit commented Nov 28, 2024 •

edited

Loading

taylorbot commented Nov 28, 2024

ubergesundheit commented Dec 3, 2024

github-actions bot commented Dec 3, 2024

AverageMarcus left a comment

Gacko commented Dec 3, 2024

AverageMarcus commented Dec 3, 2024

Make install and upgrade retries number configurable #394

Make install and upgrade retries number configurable #394

Conversation

ubergesundheit commented Nov 28, 2024 • edited Loading

What does this PR do?

What is the effect of this change to users?

Any background context you can provide?

Should this change be mentioned in the release notes?

taylorbot commented Nov 28, 2024

ubergesundheit commented Dec 3, 2024

github-actions bot commented Dec 3, 2024

AverageMarcus left a comment

Choose a reason for hiding this comment

Gacko commented Dec 3, 2024

AverageMarcus commented Dec 3, 2024

ubergesundheit commented Nov 28, 2024 •

edited

Loading