Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Scylla Manager cluster labels for cluster reconciliation #2156

Merged

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented Oct 15, 2024

Description of your changes: Currently, when the manager controller fails to save manager's cluster ID in ScyllaCluster's status after cluster's registration with manager, the cluster is deleted and recreated again. As update conflicts are not a rare occurrence, this often causes many unnecessary recreation attempts.
To make the reconciliation more robust, this PR changes this behaviour. Instead of using the ID from status, labels from manager state are used. A cluster is created with a label holding the owner's UID, which allows us to maintain and recognize cluster's identity without relying on the status of our API resources. In turn clusters are only deleted when the owner UID labels is not matching the UID of the current owner, in order to avoid name collisions.

The labels are also extended with a managed hash label to align the cluster update logic with changes recently introduced in #2142.

The logic related to creating "actions" is modified to produce one cluster-related action at once and requeue in order to only schedule any further actions on next iteration. The reasoning behind it is to try avoiding errors related to task actions in case of a required cluster action, e.g. when auth token needs to be updated first.

Additionally, the manager state computed in each reconciliation loop is reduced to only one cluster, since cluster names in manager are unique and propagating additional clusters to the state is redundant.

Unit tests are also extended to cover these scenarios and unified for consistency.

Which issue is resolved by this Pull Request:
Resolves #1902

/kind bug
/priority important-soon
/cc

Copy link
Contributor

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes: wip

Which issue is resolved by this Pull Request:
Resolves #1902

/kind bug
/priority important-soon
/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 15, 2024
@rzetelskik rzetelskik force-pushed the manager-cluster-deletion-fix branch 3 times, most recently from f28d149 to 5c455fd Compare October 15, 2024 10:08
@rzetelskik rzetelskik changed the title [WIP] Use clusters' OwnerUID labels to reconcile clusters registered with Scylla Manager Use Scylla Manager cluster labels for cluster reconciliation Oct 15, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@rzetelskik rzetelskik changed the title Use Scylla Manager cluster labels for cluster reconciliation [WIP] Use Scylla Manager cluster labels for cluster reconciliation Oct 15, 2024
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@rzetelskik rzetelskik changed the title [WIP] Use Scylla Manager cluster labels for cluster reconciliation Use Scylla Manager cluster labels for cluster reconciliation Oct 15, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@rzetelskik
Copy link
Member Author

/cc zimnx tnozicka

pkg/controller/manager/sync.go Outdated Show resolved Hide resolved
pkg/controller/manager/sync_test.go Show resolved Hide resolved
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

/assign @zimnx

pkg/controller/manager/status.go Outdated Show resolved Hide resolved
@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2024
@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

cluster provisioning failed
/retest

@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

tls test flake, can't possibly be related?
#2096 (comment)
/retest

Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

/test images
/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Manager controller recreates clusters when manager cluster ID is missing from status
3 participants