integration-cleanup: Enable reaper delete #713
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The integration test resources cleanup program, reaper, used to be run in dry-run mode before to only report on leftover resources that were more than a day old. Recently, once in a few days, some GKE clusters would provision in unusable state. In the GKE web console, the cluster shows the following error:
And trying to connect to the cluster using cloudshell results in connection errors and the following warning:
The cluster is not usable.
When this happens, terraform waits for the cluster to be ready and timeout. Due to the timeout, the cluster is not written to the terraform state file and it can't be deleted by running terraform destroy.
The GKE and individual node logs and monitoring pages don't show any other details about the issue,
In the past, these resources were manually deleted from the web console. Since this failure can happen any time, enabling the reaper cleanup took with retention period 1 hour would be helpful to ensure such unusable cluster get deleted within an hour. Test clusters don't run for more than 20 minutes. Only the resources with the tag
ci=true
will get deleted. Also, update the cron time for cleanup to run every hour.Failed test run with provision timeout https://github.com/fluxcd/image-reflector-controller/actions/runs/7470958065/job/20330456160 .
Example run to delete the recently leftover cluster from image-reflector-controller repository test cluster https://github.com/fluxcd/pkg/actions/runs/7475015909/job/20342320391#step:11:29 .
Previous attempt to solve this problem in #712 doesn't work because the cluster doesn't get registered in the state file. With this change, the cleanup should be more robust. Multiple pieces in place to perform cleanup automatically.