Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: export wallclock lag metrics also for storage collections #30568

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Nov 19, 2024

This PR rewires the existing mz_dataflow_wallclock_lag_seconds metric so that it also includes storage collections. To this end, a new ControllerMetrics type is introduced to define metrics that are exported by both the compute and the storage controller, and the wallclock lag metrics are moved there. The ControllerMetrics type is then passed to both controllers, so they can export wallclock lag metrics for their respective collections.

Note that in contrast to compute collections, the wallclock lag for storage collections is not per replica (as we are comparing with the global persist frontier here), and in some cases not even per cluster (as not all storage collections are associated with clusters). As a result, the replica_id label is always empty for storage collections, and the instance_id label is sometimes empty.

Motivation

  • This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/8235

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@teskje teskje force-pushed the storage-wallclock-metrics branch 2 times, most recently from 36834e8 to 6ce30eb Compare November 21, 2024 15:09
@teskje teskje marked this pull request as ready for review November 21, 2024 17:04
@teskje teskje requested a review from a team as a code owner November 21, 2024 17:04
Copy link

shepherdlybot bot commented Nov 21, 2024

Risk Score:81 / 100 Bug Hotspots:3 Resilience Coverage:0%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review
  • (Required) Feature Flag
  • (Required) Integration Test
  • (Required) Observability
  • (Required) QA Review
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request carries a high risk score of 81, driven by the predictors "Sum Bug Reports Of Files" and "Delta of Executable Lines." Historically, PRs with these predictors are 115% more likely to cause a bug than the repository baseline. Despite the decreasing trends in both observed and predicted bug rates, the presence of three file hotspots further elevates the risk.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../src/lib.rs 95
../src/history.rs 91
../controller/instance.rs 97

In preparation of having the storage controller export wallclock lag
metrics too, this commit factors out the common infrastructure from
`mz-compute-client` and moves it into `mz-cluster-client`.
This commit makes minor changes to the code structure around
`CollectionState` initialization in the storage controller. This removes
some redundancy, but more importantly makes it easier to attach
`WallclockLagMetrics` to the `CollectionState` in the next commit.
This commit wires the `ControllerMetrics` through to the storage
controller, creates `WallclockLagMetrics` objects in the
`CollectionState`s/`ExportState`s of all storage collections, and uses
those to update the wallclock lag metrics during every maintenance call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant