integration-cleanup: Enable reaper delete #713

darkowlzz · 2024-01-10T12:45:00Z

The integration test resources cleanup program, reaper, used to be run in dry-run mode before to only report on leftover resources that were more than a day old. Recently, once in a few days, some GKE clusters would provision in unusable state. In the GKE web console, the cluster shows the following error:

All cluster resources were brought up, but: only 2 nodes out of 3 have
registered; cluster may be unhealthy.

And trying to connect to the cluster using cloudshell results in connection errors and the following warning:

WARNING: cluster flux-test-casual-oryx is not RUNNING. The kubernetes
API may or may not be available. Check the cluster status for more
information.

The cluster is not usable.
When this happens, terraform waits for the cluster to be ready and timeout. Due to the timeout, the cluster is not written to the terraform state file and it can't be deleted by running terraform destroy.

The GKE and individual node logs and monitoring pages don't show any other details about the issue,

In the past, these resources were manually deleted from the web console. Since this failure can happen any time, enabling the reaper cleanup took with retention period 1 hour would be helpful to ensure such unusable cluster get deleted within an hour. Test clusters don't run for more than 20 minutes. Only the resources with the tag ci=true will get deleted. Also, update the cron time for cleanup to run every hour.

Failed test run with provision timeout https://github.com/fluxcd/image-reflector-controller/actions/runs/7470958065/job/20330456160 .
Example run to delete the recently leftover cluster from image-reflector-controller repository test cluster https://github.com/fluxcd/pkg/actions/runs/7475015909/job/20342320391#step:11:29 .

Previous attempt to solve this problem in #712 doesn't work because the cluster doesn't get registered in the state file. With this change, the cleanup should be more robust. Multiple pieces in place to perform cleanup automatically.

The integration test resources cleanup program, reaper, used to be run in dry-run mode before to only report on leftover resources that were more than a day old. Recently, once in a few days, some GKE clusters would provision in unusable state. In the GKE web console, the cluster shows the following error: ``` All cluster resources were brought up, but: only 2 nodes out of 3 have registered; cluster may be unhealthy. ``` And trying to connect to the cluster using cloudshell results in connection error: ``` WARNING: cluster flux-test-casual-oryx is not RUNNING. The kubernetes API may or may not be available. Check the cluster status for more information. ``` The cluster is not usage. When this happens, terraform waits for the cluster to be ready and timeout. Due to the timeout, the cluster is not written to the terraform state file and it can't be deleted by running terraform destroy. The GKE and individual node logs and monitoring pages don't show any other details about the issue, In the past, these resources were manually deleted from the web console. Since this failure can happen any time, enabling the reaper cleanup took with retention period 1 hour would be helpful to ensure such unusable cluster get deleted within a hour. Test clusters don't run for more than 20 minutes. Only the resources with the tag `ci=true` will get deleted. Also, update the cron time for cleanup to run every hour. Signed-off-by: Sunny <[email protected]>

darkowlzz added the area/testing Testing related issues and pull requests label Jan 10, 2024

darkowlzz requested a review from a team as a code owner January 10, 2024 12:45

stefanprodan approved these changes Jan 10, 2024

View reviewed changes

darkowlzz merged commit 8197e2d into main Jan 10, 2024
15 checks passed

darkowlzz deleted the enable-reaper-delete branch January 10, 2024 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration-cleanup: Enable reaper delete #713

integration-cleanup: Enable reaper delete #713

darkowlzz commented Jan 10, 2024 •

edited

Loading

integration-cleanup: Enable reaper delete #713

integration-cleanup: Enable reaper delete #713

Conversation

darkowlzz commented Jan 10, 2024 • edited Loading

darkowlzz commented Jan 10, 2024 •

edited

Loading