Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set managed hash labels in manager tasks and only update tasks when managed hash changes #2142

Merged

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented Oct 3, 2024

Description of your changes: Currently, Scylla Manager integration controller decides to update the tasks defined in ScyllaCluster's spec by checking for deep equality between the task definition in ScyllaCluster's spec and the task status obtained from the Manager's state.
Due to some discrepancies between how we and Scylla Manager keep tasks' properties, and the fact that some values would be converted, but not converted back when reading the status, this would often lead to task update hot loops, where a task update requests would be sent despite no changes in spec.

Such scenario can be observed by specifying a repair task with e.g. smallTableThreshold set:

  repairs:
  - failFast: true
    intensity: "1"
    name: repair
    numRetries: 1
    parallel: 1
    smallTableThreshold: 1GiB

The threshold would be converted to bytes, so that the corresponding status would be:

  repairs:
  - cron: '{"spec":"","start_date":"0001-01-01T00:00:00Z"}'
    failFast: true
    id: d0550c36-f799-4ac4-8b01-aab7aaf042bf
    intensity: "1"
    interval: ""
    name: repair
    numRetries: 0
    parallel: 1
    smallTableThreshold: "1073741824"
    startDate: "2024-10-02T15:39:22.907Z"
    timezone: ""

The value would not be converted back, causing the deep equality test to always return false, and so an update would be triggered on every comparison, causing a hot loop.

There were also cases reported by users in which we suspected the constant updates were causing a high cpu load: scylladb/scylla-manager#3920.

This PR addresses this issue by making use of labels recently added to the manager's task API. Similarily to how we handle managed k8s resources, the scylla-manager controller now computes a hash from the task's spec and sets a managed hash label. Therefore, an update request is only sent when a task spec changes in ScyllaCluster and the managed hash no longer matches the one set in scylla manager's state (and correspondingly, ScyllaCluster's status).
Changes in the manager's state which are not reflected in our API (e.g. when a user manually updates a task with sctool) will not trigger an update, but they will be overwritten when we eventually reconcile.

This PR does not extend ScyllaCluster's API with the labels field for Scylla Manager's tasks, at this point we're only using them internally and don't expose them in spec to the users of our API. If there ever is a need to do add it, the managed hash would take precedence over colliding labels.

As this PR further extends the logic invoked on every task sync, some parts of the code are deduplicated to reduce the bug-prone surface. Consequently we're loosing some of the duplicated unit tests and unnecessary wrappers/interfaces.

A few new unit tests are added to test that the changes in task's state do not trigger an update unless the managed hash doesn't match.

These changes also uncovered a bug being hidden by the hotloop.

Requires:

Which issue is resolved by this Pull Request:
Resolves #1827

/kind bug
/priority important-soon
/cc

Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes: wip

Which issue is resolved by this Pull Request:
Resolves #1827

/kind bug
/priority important-soon
/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 3, 2024
@rzetelskik rzetelskik changed the title [WIP] Set managed hash labels in manager tasks and only update tasks when managed hash changes Set managed hash labels in manager tasks and only update tasks when managed hash changes Oct 3, 2024
@rzetelskik rzetelskik marked this pull request as ready for review October 3, 2024 11:00
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024
@rzetelskik
Copy link
Member Author

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip 77d6328 link true /test e2e-gke-parallel-clusterip
ci/prow/e2e-gke-parallel 77d6328 link true /test e2e-gke-parallel
Full PR test history. Your PR dashboard.

#2061 (comment)
known flake with colliding snapshot tags, unrelated
/retest

@rzetelskik
Copy link
Member Author

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel 77d6328 link true /test e2e-gke-parallel
ci/prow/e2e-gke-parallel-clusterip 77d6328 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

#1028 (comment)
known shardawareness flake, unrelated
/retest

@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip 77d6328 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

Timed out waiting for tasks to sync, I'll need to look into it.
/hold

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2024
@rzetelskik rzetelskik changed the title Set managed hash labels in manager tasks and only update tasks when managed hash changes [WIP] Set managed hash labels in manager tasks and only update tasks when managed hash changes Oct 3, 2024
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024
@rzetelskik rzetelskik force-pushed the manager-task-update-fix branch 4 times, most recently from 6655698 to c62eb49 Compare October 7, 2024 09:20
@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip 77d6328 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

Timed out waiting for tasks to sync, I'll need to look into it. /hold

Being fixed in #2143.
No longer WIP, still waiting for the prerequisite to merge.

@rzetelskik rzetelskik changed the title [WIP] Set managed hash labels in manager tasks and only update tasks when managed hash changes Set managed hash labels in manager tasks and only update tasks when managed hash changes Oct 8, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 8, 2024
@rzetelskik
Copy link
Member Author

#2143 merged
rebased
/hold cancel

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2024
@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip e6eaa65 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

#2096 (comment)
known flake, unrelated
/retest

@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip e6eaa65 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

/test images
/retest

Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
/assign tnozicka

@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024
@rzetelskik
Copy link
Member Author

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel-clusterip e6eaa65 link true /test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.

suite timed out
/retest

@tnozicka
Copy link
Member

/lgtm
/approve

@rzetelskik rzetelskik changed the title Set managed hash labels in manager tasks and only update tasks when managed hash changes [WIP] Set managed hash labels in manager tasks and only update tasks when managed hash changes Oct 14, 2024
@scylla-operator-bot scylla-operator-bot bot added lgtm Indicates that a PR is ready to be merged. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 14, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rzetelskik
Copy link
Member Author

/hold
I want to see if we can move away from converting the types of tasks in manager state and make it less confusing (it would also let me handle tasks and clusters the same way in the other PR), bear with me

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024
@rzetelskik
Copy link
Member Author

/hold I want to see if we can move away from converting the types of tasks in manager state and make it less confusing (it would also let me handle tasks and clusters the same way in the other PR), bear with me

/hold cancel
since the PR got a lgtm already I can also do this separately if I find it to be the case

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024
@rzetelskik rzetelskik changed the title [WIP] Set managed hash labels in manager tasks and only update tasks when managed hash changes Set managed hash labels in manager tasks and only update tasks when managed hash changes Oct 14, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 14, 2024
@scylla-operator-bot scylla-operator-bot bot merged commit 04a5ae3 into scylladb:master Oct 14, 2024
13 checks passed
@rzetelskik rzetelskik deleted the manager-task-update-fix branch October 14, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla Manager controller will update tasks despite no changes in spec
3 participants