Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration-cleanup: Enable reaper delete #713

Merged
merged 1 commit into from
Jan 10, 2024
Merged

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Jan 10, 2024

The integration test resources cleanup program, reaper, used to be run in dry-run mode before to only report on leftover resources that were more than a day old. Recently, once in a few days, some GKE clusters would provision in unusable state. In the GKE web console, the cluster shows the following error:

All cluster resources were brought up, but: only 2 nodes out of 3 have
registered; cluster may be unhealthy.

And trying to connect to the cluster using cloudshell results in connection errors and the following warning:

WARNING: cluster flux-test-casual-oryx is not RUNNING. The kubernetes
API may or may not be available. Check the cluster status for more
information.

The cluster is not usable.
When this happens, terraform waits for the cluster to be ready and timeout. Due to the timeout, the cluster is not written to the terraform state file and it can't be deleted by running terraform destroy.

The GKE and individual node logs and monitoring pages don't show any other details about the issue,

In the past, these resources were manually deleted from the web console. Since this failure can happen any time, enabling the reaper cleanup took with retention period 1 hour would be helpful to ensure such unusable cluster get deleted within an hour. Test clusters don't run for more than 20 minutes. Only the resources with the tag ci=true will get deleted. Also, update the cron time for cleanup to run every hour.

Failed test run with provision timeout https://github.com/fluxcd/image-reflector-controller/actions/runs/7470958065/job/20330456160 .
Example run to delete the recently leftover cluster from image-reflector-controller repository test cluster https://github.com/fluxcd/pkg/actions/runs/7475015909/job/20342320391#step:11:29 .

Previous attempt to solve this problem in #712 doesn't work because the cluster doesn't get registered in the state file. With this change, the cleanup should be more robust. Multiple pieces in place to perform cleanup automatically.

The integration test resources cleanup program, reaper, used to be run
in dry-run mode before to only report on leftover resources that were
more than a day old. Recently, once in a few days, some GKE clusters
would provision in unusable state. In the GKE web console, the cluster
shows the following error:

```
All cluster resources were brought up, but: only 2 nodes out of 3 have
registered; cluster may be unhealthy.
```

And trying to connect to the cluster using cloudshell results in
connection error:

```
WARNING: cluster flux-test-casual-oryx is not RUNNING. The kubernetes
API may or may not be available. Check the cluster status for more
information.
```

The cluster is not usage.
When this happens, terraform waits for the cluster to be ready and
timeout. Due to the timeout, the cluster is not written to the terraform
state file and it can't be deleted by running terraform destroy.

The GKE and individual node logs and monitoring pages don't show any
other details about the issue,

In the past, these resources were manually deleted from the web console.
Since this failure can happen any time, enabling the reaper cleanup took
with retention period 1 hour would be helpful to ensure such unusable
cluster get deleted within a hour. Test clusters don't run for more than
20 minutes. Only the resources with the tag `ci=true` will get deleted.
Also, update the cron time for cleanup to run every hour.

Signed-off-by: Sunny <[email protected]>
@darkowlzz darkowlzz added the area/testing Testing related issues and pull requests label Jan 10, 2024
@darkowlzz darkowlzz requested a review from a team as a code owner January 10, 2024 12:45
@darkowlzz darkowlzz merged commit 8197e2d into main Jan 10, 2024
15 checks passed
@darkowlzz darkowlzz deleted the enable-reaper-delete branch January 10, 2024 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Testing related issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants