fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

wallee94 · 2024-08-16T17:14:01Z

What this PR does / why we need it:

I see the following issue in a ksm deployment with custom CRD config enabled. Whenever a new CRD event is added to the cache, ksm stops watching and metrics stop reflecting new changes.

In the graph, ksm recovers after updating the Statefulset with any change. The metrics are rate(kube_state_metrics_watch_total) and the counter kube_state_metrics_custom_resource_state_add_events_total without rate.

Looking into the code, the problem seems to be the validation shardingUnchanged in AddFunc (here).

Without a CRD config, MetricsHandler.Run runs only once, and the vars m.curShard and m.curTotalShards are initially nil, which makes shardingUnchanged = false (here).

However, if a CRD config is present, discovery runs MetricsHandler.Run every time a CRD event is detected (here). If the Statefulset number of replicas/shards didn't change, the new CRD event will cancel the old metrics handler, but won't initiate a new one because shardingUnchanged = true in AddFunc.

This change removes the check shardingUnchanged in the AddFunc event handler. I don't think it's necessary because, in most cases, it's only called when the informer is synced at the end of MetricsHandler.Run.

This change updates CRDiscoverer.PollForCacheUpdates to rebuild the metrics writers in the already running metrics handler, instead of running a new one every time a CRD event occurs.

How does this change affect the cardinality of KSM:

No change in cardinality.

Which issue(s) this PR fixes:

Fixes #2372

k8s-ci-robot · 2024-08-16T17:14:10Z

Welcome @wallee94!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

Signed-off-by: Walther Lee <[email protected]>

logicalhan · 2024-08-22T16:50:49Z

/assign @CatherineF-dev
/triage accepted

CatherineF-dev · 2024-08-24T01:01:54Z

Hello, this code has been around for five years. Why is it only now experiencing this issue? Could there be a another underlying problem? If we can figure out how to only Run once inside https://github.com/kubernetes/kube-state-metrics/pull/1851/files, it will be the fix.

			m.mtx.RLock()
			shardingUnchanged := m.curShard == shard && m.curTotalShards == totalShards
			m.mtx.RUnlock()

			if shardingUnchanged {
				return
			}

wallee94 · 2024-08-24T01:32:59Z

Sorry, I mentioned it in a thread in the kube-state-metrics Slack channel and forgot to put it here.

The bug isn't exactly in metrics_handler if Run runs only once, which is how it works without CR. The issue is that discovery started running it multiple times in this PR if there's a CR config.

Removing the validation in AddFunc made sense to me to reconfigure every time the index informer is initially synced. I don't think it should depend on the previous m.curShard and m.curTotalShards values of earlier runs.

CatherineF-dev · 2024-08-24T01:37:22Z

The issue is that discovery started running it multiple times in this PR if there's a CR config.

Thanks for spotting this! I am thinking whether we should run it only once.

wallee94 · 2024-08-24T01:58:26Z

That's a good point, I can look into that. I think m.Run could probably be just m.ConfigureSharding, but discovery doesn't have access to the Statefulset or k8s client to calculate shard and totalShards.

On the other hand, if Run runs only once, the validation in this change is usually false because m.curShard and m.curTotalShards are initially 0.

wallee94 · 2024-08-27T02:03:54Z

I've made some changes to use m.ConfigureSharding in discovery and to run m.Run only once. I'm testing it on a test cluster and so far it's working fine. I'll let it run for a few more hours and push it tomorrow if everything looks good

Signed-off-by: Walther Lee <[email protected]>

wallee94 · 2024-08-27T21:20:49Z

@CatherineF-dev I added new changes to run m.Run only once when the pod starts, and to reconfigure sharding in discovery instead of recreating the IndexInformer. I also removed some of the contexts in PollForCacheUpdates because m.ConfigureSharding already cancels the previous context.

I deployed the change to a few clusters and it seems to be working. This is the watch rate after adding a new CRD to the cluster:

I see the event, then a brief drop, which is when ksm is populating the cache after the reconfigure, and then it comes back to normal. All the metrics look good as well.

Signed-off-by: Walther Lee <[email protected]>

internal/discovery/discovery.go

Signed-off-by: Walther Lee <[email protected]>

wallee94 · 2024-09-27T18:21:18Z

A summary of changes per file to help with the review:

internal/discovery/discovery.go: Use m.BuildWriters(ctx) instead of m.Run(olderContext) to rebuild writers instead of recreating the whole handler. BuildWriters cancels the old context, so we no longer need the cancelations in this file.
pkg/metricshandler/metrics_handler.go: adds a function BuildWriters that rebuilds the metrics writers. We use it in discovery.go when the resources in the StoreBuilder have changed.
tests/e2e/discovery_test.go: Update to test custom metrics after updating a CRD. I broke this into subfunctions because it failed the cyclomatic complexity linter.

internal/store/builder.go

mrueg

Small typo in the comments, otherwise looks good to me. Thanks for the contribution!

/lgtm
and
/hold
for others to review.

pkg/metricshandler/metrics_handler.go

mal-berbatov-ci · 2024-10-03T11:37:44Z

Is there an ETA for when this will get merged?

mrueg · 2024-10-08T08:02:22Z

/lgtm

I'll still ping the other maintainers to review, currently everyone seems to be busy.

k8s-ci-robot · 2024-10-08T08:02:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, wallee94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrueg · 2024-10-15T07:51:20Z

/hold cancel

continuing here as no other reviews came in.

Thanks for the debugging and your contribution!

mrueg · 2024-10-15T08:19:32Z

/hold cancel

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2024

k8s-ci-robot requested review from CatherineF-dev and dgrisonnet August 16, 2024 17:14

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 16, 2024

fix(discovery): configure sharding every time MetricsHandler.Run runs

dec7a1b

Signed-off-by: Walther Lee <[email protected]>

wallee94 force-pushed the remove-shard-validation-on-sts-create branch from 102f391 to dec7a1b Compare August 16, 2024 17:16

wallee94 changed the title ~~fix(discovery) configure sharding every time MetricsHandler.Run runs~~ fix(discovery): configure sharding every time MetricsHandler.Run runs Aug 16, 2024

k8s-ci-robot assigned CatherineF-dev Aug 22, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 22, 2024

update discovery to reconfigure sharding

082aba2

Signed-off-by: Walther Lee <[email protected]>

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 27, 2024

fix race condition in metrics handler

4aced25

Signed-off-by: Walther Lee <[email protected]>

wallee94 force-pushed the remove-shard-validation-on-sts-create branch from 0616572 to 4aced25 Compare September 5, 2024 21:02

dgrisonnet reviewed Sep 10, 2024

View reviewed changes

internal/discovery/discovery.go Outdated Show resolved Hide resolved

dgrisonnet reviewed Sep 12, 2024

View reviewed changes

internal/discovery/discovery.go Show resolved Hide resolved

rexagod mentioned this pull request Sep 12, 2024

KEP-4785: Resource State Metrics kubernetes/enhancements#4811

Open

wallee94 added 3 commits September 12, 2024 10:45

Merge branch 'main' into remove-shard-validation-on-sts-create

8195dbe

decouple BuildWriters from ConfigureSharding

1a1ca73

Signed-off-by: Walther Lee <[email protected]>

update func docstring

012121e

Signed-off-by: Walther Lee <[email protected]>

fix resource duplication in store builder

9aad0c0

Signed-off-by: Walther Lee <[email protected]>

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 13, 2024

wallee94 requested a review from dgrisonnet September 13, 2024 19:40

wallee94 added 2 commits September 24, 2024 12:17

fix ciclomatic complexity lint in TestVariableVKsDiscoveryAndResolution

fb46163

Merge branch 'main' into remove-shard-validation-on-sts-create

294163d

mrueg reviewed Sep 30, 2024

View reviewed changes

internal/store/builder.go Outdated Show resolved Hide resolved

undo changes in internal/store/builder.go

7b375e9

wallee94 requested a review from mrueg September 30, 2024 16:23

mrueg reviewed Sep 30, 2024

View reviewed changes

pkg/metricshandler/metrics_handler.go Outdated Show resolved Hide resolved

pkg/metricshandler/metrics_handler.go Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024

k8s-ci-robot assigned mrueg Sep 30, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 30, 2024

fix typo in comments

166921b

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2024

mrueg added this to the v2.14.0 milestone Oct 8, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2024

k8s-ci-robot merged commit f3b7593 into kubernetes:main Oct 15, 2024
12 checks passed

dgrisonnet mentioned this pull request Oct 15, 2024

WIP: Testing: discovery: check number of goroutines #2525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

wallee94 commented Aug 16, 2024 •

edited

Loading

k8s-ci-robot commented Aug 16, 2024

logicalhan commented Aug 22, 2024

CatherineF-dev commented Aug 24, 2024 •

edited

Loading

wallee94 commented Aug 24, 2024

CatherineF-dev commented Aug 24, 2024 •

edited

Loading

wallee94 commented Aug 24, 2024 •

edited

Loading

wallee94 commented Aug 27, 2024

wallee94 commented Aug 27, 2024 •

edited

Loading

wallee94 commented Sep 27, 2024 •

edited

Loading

mrueg left a comment

mal-berbatov-ci commented Oct 3, 2024

mrueg commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

mrueg commented Oct 15, 2024 •

edited

Loading

mrueg commented Oct 15, 2024

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

Conversation

wallee94 commented Aug 16, 2024 • edited Loading

k8s-ci-robot commented Aug 16, 2024

logicalhan commented Aug 22, 2024

CatherineF-dev commented Aug 24, 2024 • edited Loading

wallee94 commented Aug 24, 2024

CatherineF-dev commented Aug 24, 2024 • edited Loading

wallee94 commented Aug 24, 2024 • edited Loading

wallee94 commented Aug 27, 2024

wallee94 commented Aug 27, 2024 • edited Loading

wallee94 commented Sep 27, 2024 • edited Loading

mrueg left a comment

Choose a reason for hiding this comment

mal-berbatov-ci commented Oct 3, 2024

mrueg commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

mrueg commented Oct 15, 2024 • edited Loading

mrueg commented Oct 15, 2024

wallee94 commented Aug 16, 2024 •

edited

Loading

CatherineF-dev commented Aug 24, 2024 •

edited

Loading

CatherineF-dev commented Aug 24, 2024 •

edited

Loading

wallee94 commented Aug 24, 2024 •

edited

Loading

wallee94 commented Aug 27, 2024 •

edited

Loading

wallee94 commented Sep 27, 2024 •

edited

Loading

mrueg commented Oct 15, 2024 •

edited

Loading