Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove digest check to never ignore helm uninstall errors #1024

Merged
merged 3 commits into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions docs/spec/v2/helmreleases.md
Original file line number Diff line number Diff line change
Expand Up @@ -1224,6 +1224,60 @@ Using `flux`:
flux reconcile helmrelease <helmrelease-name> --reset
```

### Handling failed uninstall

At times, a Helm uninstall may fail due to the resource deletion taking a long
time, resources getting stuck in deleting phase due to some resource delete
policy in the cluster or some failing delete hooks. Depending on the scenario,
this can be handled in a few different ways.

For resources that take long to delete but are certain to get deleted without
any intervention, failed uninstall will be retried until they succeeds. The
HelmRelease object will remain in a failed state until the uninstall succeeds.
Once uninstall is successful, the HelmRelease object will get deleted.

If resources get stuck at deletion due to some dependency on some other
resource or policy, the controller will keep retrying to delete the resources.
The HelmRelease object will remain in a failed state. Once the cause of resource
deletion issue is resolved by intervention, HelmRelease uninstallation will
succeed and the HelmRelease object will get deleted. In case the cause of the
deletion issue can't be resolved, the HelmRelease can be force deleted by
manually deleting the [Helm storage
secret](https://helm.sh/docs/topics/advanced/#storage-backends) from the
respective release namespace. When the controller retries uninstall and cannot
find the release, it assumes that the release has been deleted, Helm uninstall
succeeds and the HelmRelease object gets deleted. This leaves behind all the
release resources. They have to be manually deleted.

If a chart with pre-delete hooks fail, the controller will re-run the hooks
until they succeed and unblock the uninstallation. The Helm uninstall error
will be present in the status of HelmRelease. This can be used to identify which
hook is failing. If the hook failure persists, to run uninstall without the
hooks, equivalent of running `helm uninstall --no-hooks`, update the HelmRelease
to set `.spec.uninstall.disableHooks` to `true`.

```yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
...
spec:
...
uninstall:
disableHooks: true
```

In the next reconciliation, the controller will run Helm uninstall without the
hooks. On success, the HelmRelease will get deleted. Otherwise, check the status
of the HelmRelease for other failure that may be blocking the uninstall.

In case of charts with post-delete hooks, since the hook runs after the deletion
of the resources and the Helm storage, the hook failure will result in an
initial uninstall failure. In the subsequent reconciliation to retry uninstall,
since the Helm storage for the release got deleted, uninstall will succeed and
the HelmRelease object will get deleted.

Any leftover pre or post-delete hook resources have to be manually deleted.

### Waiting for `Ready`

When a change is applied, it is possible to wait for the HelmRelease to reach a
Expand Down
64 changes: 64 additions & 0 deletions internal/controller/helmrelease_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2628,6 +2628,70 @@ func TestHelmReleaseReconciler_reconcileUninstall(t *testing.T) {
g.Expect(conditions.GetMessage(obj, meta.ReadyCondition)).To(ContainSubstring("no namespace provided"))
g.Expect(obj.GetConditions()).To(HaveLen(1))
})

t.Run("error due to failing delete hook", func(t *testing.T) {
g := NewWithT(t)

ns, err := testEnv.CreateNamespace(context.TODO(), "reconcile-uninstall")
g.Expect(err).ToNot(HaveOccurred())
t.Cleanup(func() {
_ = testEnv.Delete(context.TODO(), ns)
})

rls := testutil.BuildRelease(&helmrelease.MockReleaseOptions{
Name: "reconcile-uninstall",
Namespace: ns.Name,
Version: 1,
Chart: testutil.BuildChart(testutil.ChartWithFailingHook()),
Status: helmrelease.StatusDeployed,
}, testutil.ReleaseWithFailingHook())

obj := &v2.HelmRelease{
ObjectMeta: metav1.ObjectMeta{
Name: "reconcile-uninstall",
Namespace: ns.Name,
DeletionTimestamp: &metav1.Time{Time: time.Now()},
},
Spec: v2.HelmReleaseSpec{
Uninstall: &v2.Uninstall{
KeepHistory: true,
Timeout: &metav1.Duration{Duration: time.Millisecond},
},
},
Status: v2.HelmReleaseStatus{
StorageNamespace: ns.Name,
History: v2.Snapshots{
release.ObservedToSnapshot(release.ObserveRelease(rls)),
},
},
}

r := &HelmReleaseReconciler{
Client: testEnv.Client,
GetClusterConfig: GetTestClusterConfig,
EventRecorder: record.NewFakeRecorder(32),
}

// Store the Helm release mock in the test namespace.
getter, err := r.buildRESTClientGetter(context.TODO(), obj)
g.Expect(err).ToNot(HaveOccurred())

cfg, err := action.NewConfigFactory(getter, action.WithStorage(helmdriver.SecretsDriverName, obj.Status.StorageNamespace))
g.Expect(err).ToNot(HaveOccurred())

store := helmstorage.Init(cfg.Driver)
g.Expect(store.Create(rls)).To(Succeed())

err = r.reconcileUninstall(context.TODO(), getter, obj)
g.Expect(err).To(HaveOccurred())

// Verify status of Helm release has not been updated.
g.Expect(obj.Status.StorageNamespace).ToNot(BeEmpty())

// Verify Helm release has not been uninstalled.
_, err = store.History(rls.Name)
g.Expect(err).ToNot(HaveOccurred())
})
}

func TestHelmReleaseReconciler_checkDependencies(t *testing.T) {
Expand Down
5 changes: 1 addition & 4 deletions internal/reconcile/uninstall.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,7 @@ func (r *Uninstall) Reconcile(ctx context.Context, req *Request) error {
// Handle any error.
if err != nil {
r.failure(req, logBuf, err)
if req.Object.Status.History.Latest().Digest == cur.Digest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwrau the tests are failing, please run make test before opening a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought I didn't understand the code well enough 😅

I don't understand why one wouldn't want to return an error in case of an error, even in this test case. The error is timed out waiting for the condition, why not return an error to be retried?

Or rather, why is this working in this case, but not in the case of #1021?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I shared my research about this in a separate comment below. I hope that makes this clear. We can still discuss any other doubts about it.

return err
}
return nil
return err
}

// Mark success.
Expand Down
8 changes: 8 additions & 0 deletions internal/reconcile/uninstall_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ func TestUninstall_Reconcile(t *testing.T) {
status func(releases []*helmrelease.Release) v2.HelmReleaseStatus
// wantErr is the error that is expected to be returned.
wantErr error
// wantErrString is the error string that is expected to be in the
// returned error. This is used for scenarios that return
// untyped/unwrapped error that can't be asserted for their value.
wantErrString string
// expectedConditions are the conditions that are expected to be set on
// the HelmRelease after running rollback.
expectConditions []metav1.Condition
Expand Down Expand Up @@ -150,6 +154,7 @@ func TestUninstall_Reconcile(t *testing.T) {
}
},
expectFailures: 1,
wantErrString: "timed out waiting",
},
{
name: "uninstall failure without storage update",
Expand Down Expand Up @@ -244,6 +249,7 @@ func TestUninstall_Reconcile(t *testing.T) {
}
},
expectFailures: 1,
wantErrString: "Failed to purge the release",
},
{
name: "uninstall without current",
Expand Down Expand Up @@ -482,6 +488,8 @@ func TestUninstall_Reconcile(t *testing.T) {
})
if tt.wantErr != nil {
g.Expect(errors.Is(got, tt.wantErr)).To(BeTrue())
} else if tt.wantErrString != "" {
g.Expect(got.Error()).To(ContainSubstring(tt.wantErrString))
} else {
g.Expect(got).ToNot(HaveOccurred())
}
Expand Down