Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove digest check to never ignore helm uninstall errors #1024

Merged
merged 3 commits into from
Sep 20, 2024

Conversation

cwrau
Copy link
Contributor

@cwrau cwrau commented Jul 11, 2024

Closes #1021

@cwrau cwrau changed the title fix: remove digest check to never ignore helm errors Closes #1021 fix: remove digest check to never ignore helm errors Jul 11, 2024
@cwrau cwrau force-pushed the fix/dont-ignore-helm-errors branch from b830e08 to dbf6ede Compare August 7, 2024 12:13
@@ -138,10 +138,7 @@ func (r *Uninstall) Reconcile(ctx context.Context, req *Request) error {
// Handle any error.
if err != nil {
r.failure(req, logBuf, err)
if req.Object.Status.History.Latest().Digest == cur.Digest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwrau the tests are failing, please run make test before opening a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought I didn't understand the code well enough 😅

I don't understand why one wouldn't want to return an error in case of an error, even in this test case. The error is timed out waiting for the condition, why not return an error to be retried?

Or rather, why is this working in this case, but not in the case of #1021?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I shared my research about this in a separate comment below. I hope that makes this clear. We can still discuss any other doubts about it.

@darkowlzz
Copy link
Contributor

Hi, I'll help take this forward so that we can include this in the upcoming release.

As mentioned in #1021 (comment), the change seems harmless and we want to go ahead with it.

I did some research and testing to understand the issue and discussed the consequences of this change and how to handle them privately with other maintainers. I would like to share the details below.

A major concern about this change is that this has the potential to result in HelmReleases to get stuck in an uninstall retry loop if the cause of the uninstall failure never gets resolved on its own. To understand this better and also to find out how the Helm CLI behaves in such scenarios, I did some manual testing. For a simple chart to test with, I added a delete hook in the podinfo chart, see the diff. With this, I was able to observe the difference in behavior of pre and post-delete hooks.

In case of post-delete hooks, helm first deletes the resources and the helm storage and then runs the post-delete hook. If the hook fails, it results in the following error:

Error: uninstallation completed with 1 error(s): 1 error occurred:
	* pod mypodinfo-uninstall-fail-qmyll failed

As evident from the error, the uninstallation completed, and the release has been deleted completely. Uninstall can't be re-run anymore. Depending on the delete policy of the hook, if the hook resources are not deleted automatically, they have to be deleted manually.

In case of pre-delete hooks, helm first runs the hooks. If a hook fails, it blocks the whole uninstallation with the following error

Error: 1 error occurred:
	* pod mypodinfo-uninstall-fail-cre44 failed

The release enters uninstalling state, but is stuck due to the failing hook. Uninstallation can be re-run to re-run the hook. If the hook doesn't succeed, helm uninstall --no-hooks can be run to bypass the hook and uninstall the resources. Another way to delete the release is to delete the helm storage secret from the release namespace. But this leaves behind all the release resources and they have to be manually deleted. --no-hooks is nicer if that works for the scenario.

I tried the same scenarios with helm-controller along with the patch proposed in this change. It behaves the same as the CLI now. Uninstall error is not ignored, HelmRelease object remains until uninstall succeeds. For post-delete hook failure, uninstall fails initially, but because the helm storage gets deleted, the uninstall retry cannot find the release and results in a successful uninstall, deleting the HelmRelease object.

An equivalent of --no-hooks can be achieved in HelmRelease by setting .spec.uninstall.disableHooks field to true, even while it's marked for deletion. In the next retry reconciliation to uninstall, the disableHooks is taken into consideration and helm uninstall is run without the hooks.

Based on the above observations, I believe we understand the consequences of the change and what to tell the users to do if uninstall gets into a retry loop. I'll add a docs section about it under the "Working with HelmRelease" section in our spec docs.

Regarding the code change, it made me wonder about the UninstallRemediation reconciler. If that uninstall code also needs to change. After analyzing how and where it's used, I concluded that Uninstall and UninstallRemediation are used differently and what we are changing is more relevant to Uninstall.
Uninstall is called outside of the AtomicRelease reconciliation. Whenever an object is marked to be deleted or the release target configuration has changed, Uninstall is called directly and the result of Uninstall is taken as it is by the caller, which is mostly the main reconciliation loop. Any error returned by Uninstall is critical to determine the result of it.
In case of UninstallRemediation, it is always called from the atomic reconciliation, which is not strict about the result of the action reconciler. After running an action reconciler, the atomic reconciler separately determines the state of the release to make a decision on the next action. It checks for any failure in the release due to the previous action by analyzing the state of the release in the storage. It doesn't need to depend on the returned error from the UninstallRemediation.
I believe this justifies not changing UninstallRemediation.

I have also added a controller test for uninstall failure due to a failing delete hook, in addition to fixing the Uninstall reconciler unit tests.

I'll push the changes as separate commits. We can discuss further if there's anything incorrect in my analysis and reasoning above.

@darkowlzz darkowlzz force-pushed the fix/dont-ignore-helm-errors branch from dbf6ede to 7fee60e Compare September 19, 2024 00:03
@cwrau
Copy link
Contributor Author

cwrau commented Sep 19, 2024

Hi, I'll help take this forward so that we can include this in the upcoming release.

Amazing, I'm currently struggling with the tests 🙏

Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninstall is called outside of the AtomicRelease reconciliation. Whenever an object is marked to be deleted or the release target configuration has changed, Uninstall is called directly and the result of Uninstall is taken as it is by the caller, which is mostly the main reconciliation loop. Any error returned by Uninstall is critical to determine the result of it.
In case of UninstallRemediation, it is always called from the atomic reconciliation, which is not strict about the result of the action reconciler. After running an action reconciler, the atomic reconciler separately determines the state of the release to make a decision on the next action. It checks for any failure in the release due to the previous action by analyzing the state of the release in the storage. It doesn't need to depend on the returned error from the UninstallRemediation.

@darkowlzz I can understand (and am OK with) everything before the paragraph above. If you have time, could you please elaborate this part a bit more? Thanks! 🙏

In any case, LGTM!

Thanks @cwrau and @darkowlzz for working on this!

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz and @cwrau

@stefanprodan stefanprodan merged commit a7c83f6 into fluxcd:main Sep 20, 2024
6 checks passed
@cwrau
Copy link
Contributor Author

cwrau commented Sep 23, 2024

Amazing! Thanks for the help @darkowlzz!

@cwrau cwrau deleted the fix/dont-ignore-helm-errors branch September 23, 2024 09:19
@darkowlzz
Copy link
Contributor

If you have time, could you please elaborate this part a bit more?

@matheuscscp HelmRelease reconciler has multiple sub-reconcilers for different helm actions that are managed by the atomic release reconciler. In

func (r *AtomicRelease) Reconcile(ctx context.Context, req *Request) error {
see how the atomic reconciler determines about the current state of the release and decides which action to run. UninstallRemediation is one of the actions that is run by atomic reconciler. Uninstall is not run from the atomic reconciler. Both of these are used in different ways. If the release is marked to be deleted, the atomic reconciliation is not called, only Uninstall is called directly from the main reconciliation loop, refer
if err = r.reconcileUninstall(ctx, getter, obj); err != nil && !errors.Is(err, intreconcile.ErrNoLatest) {
. I hope the links would make it clear. I would repeat the same paragraph to explain it further.

@darkowlzz darkowlzz changed the title fix: remove digest check to never ignore helm errors fix: remove digest check to never ignore helm uninstall errors Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HelmRelease deletion not waiting for helm uninstall
4 participants