Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oci/int: Add separate resource cleanup step #712

Merged
merged 1 commit into from
Jan 8, 2024
Merged

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Jan 5, 2024

⚠️ Depends on fluxcd/test-infra#29

Introduce a destroy-only mode in the integration test runner to run terraform destroy for the respective cloud provider configurations. This can be used to destroy cloud resources without going through the whole provision-test process.

Add a new step in github actions workflow to run the test binary in destoy-only mode at the very end irrespective of the result of the previous steps. This ensures that the infrastructure is always destroyed, even if the CI job is cancelled.
Refer https://docs.github.com/en/actions/learn-github-actions/expressions#always . This can also help ensure everything is cleaned up when a CI job is cancelled as the cleanup step will run always.

This change doesn't use the Options from test-infra/tftestenv for the destroy-only flag as that has not been adopted by the current test setup yet.

This is added to solve a recent CI failure due to a failure in GCP which resulted in the cluster provisioning to take more than 30 minutes, which is the test timeout duration. After the timeout, the test binary got terminated and couldn't perform graceful cleanup. To work around such scenarios, the cleanup can be run separately at the end with its own timeout to not affect the test runtime.
Refer: https://github.com/fluxcd/pkg/actions/runs/7409263553/job/20159178143#step:13:338

Example test run: https://github.com/fluxcd/pkg/actions/runs/7423829019/job/20202121975#step:14:22
Another example run where cancelled job runs the cleanup: https://github.com/fluxcd/pkg/actions/runs/7424467088/job/20204287128#step:13:59
But this is still subject to the state being written in terraform state file. If the job is cancelled too early, before it gets written in terraform state file, the cleanup won't be able to detect and delete the resources.

@darkowlzz darkowlzz added the area/testing Testing related issues and pull requests label Jan 5, 2024
@darkowlzz darkowlzz requested review from stefanprodan and a team as code owners January 5, 2024 16:21
@darkowlzz darkowlzz force-pushed the oci-int-destroy-only branch 3 times, most recently from 0d94296 to 62c1416 Compare January 8, 2024 13:54
Introduce a destroy-only mode in the test runner to run terraform
destroy for the respective cloud provider configurations. This can be
used to destroy cloud resources without going through the whole
provision-test process.

Add a new step in github actions workflow to run the test binary in
destoy-only mode at the very end irrespective of the result of the
previous steps. This ensures that the infrastructure is always
destroyed, even if the CI job is cancelled.

This is added to solve a recent CI failure due to a failure in GCP which
resulted in the cluster provisioning to take more than 30 minutes,
which is the test timeout duration. After the timeout, the test binary
got terminated and couldn't perform graceful stop and cleanup. To work
around such scenarios, the cleanup can be run separately at the end with
its own timeout to not affect the test runtime.

Signed-off-by: Sunny <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Testing related issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants