Report failures of periodic jobs to the cluster-api Slack channel #10520

sbueringer · 2024-04-26T07:52:14Z

I noticed that CAPO is reporting periodic test failures to Slack, e.g.: https://kubernetes.slack.com/archives/CFKJB65G9/p1713540048571589

I think think this is a great way to surface issues with CI (and also folks can directly start a thread based on a Slack comment like this)

This could be configured ~ like this: https://github.com/kubernetes/test-infra/blob/5d7e1db75dce28537ba5f17476882869d1b94b0a/config/jobs/kubernetes-sigs/cluster-api-provider-openstack/cluster-api-provider-openstack-periodics.yaml#L48-L55

What do you think?

sbueringer · 2024-04-26T07:52:19Z

cc @chrischdi @fabriziopandini

k8s-ci-robot · 2024-04-26T07:52:21Z

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2024-04-26T08:56:35Z

Oh wow, yeah that would be a great thing. I just fear that it may pollute the channel too much. But we could try and fail fast by asking for feedback if it is too much later on in the community meeting or via a slack thread/poll.

killianmuldoon · 2024-04-26T09:03:20Z

do we know if this respects testgrid-num-failures-to-alert? If so it could be great.

sbueringer · 2024-04-26T09:08:40Z

I'm not sure if it respects that. We could try and rollback if it doesn't?

sbueringer · 2024-04-26T09:09:37Z

If it still pollutes the channel too much after considering testgrid-num-failures-to-alert we have to focus more on CI :D

(I"m currently guessing that we would get one Slack message for every mail that we get today, but I don't know)

killianmuldoon · 2024-04-26T09:10:14Z

One slack message per mail would be perfect - more would disrupt the channel

WDYT about enabling it for CAPV first?

killianmuldoon · 2024-04-26T09:10:46Z

Also fine with making the change and rolling back if it doesn't work

sbueringer · 2024-04-26T09:13:11Z

One slack message per mail would be perfect - more would disrupt the channel
WDYT about enabling it for CAPV first?

Fine for me, we can also ask the OpenStack folks how spamy it is for them today (cc @mdbooth @lentzi90)

lentzi90 · 2024-04-26T09:17:56Z

For CAPO we get a message for every failure and email only after 2 failures in a row. I think it has been tolerable for us, but that indicates it does not check testgrid-num-failures-to-alert (at least the way we have it configured)

sbueringer · 2024-04-26T09:24:06Z

Hm okay, every failure is just too much. So we should probably take a closer look at the configuration / implementation. One message for every failure just doesn't make sense for the amount of tests/failures we have (the signal/noise ratio is just wrong)

fabriziopandini · 2024-04-29T11:56:59Z

+1 to test this if we find a config reasonably noisy (but not too much noisy)
cc @kubernetes-sigs/cluster-api-release-team

/priority backlog
/kind feature

adilGhaffarDev · 2024-04-29T12:09:10Z

+1 from my side too. Tagging CI lead @Sunnatillo
I will add this to improvement tasks for v1.8 cycle. CI team can look into this one.

Sunnatillo · 2024-04-30T06:55:33Z

Sounds great. I will take a look

Sunnatillo · 2024-05-30T13:54:08Z

I guess this testgrid-num-failures-to-alert should help with the amount of the noise. If we set it, for example, to 5 we will be sure that we will receive messages about constantly failing tests. This makes the config to sent the alert after 5 continuous failures.

Sunnatillo · 2024-05-30T13:55:17Z

/assign @Sunnatillo

lentzi90 · 2024-05-31T05:23:35Z

@Sunnatillo testgrid-num-failures-to-alert does not affect the slack messages for CAPO at least. Only emails are affected by that in my experience.

Sunnatillo · 2024-05-31T09:41:30Z

@Sunnatillo testgrid-num-failures-to-alert does not affect the slack messages for CAPO at least. Only emails are affected by that in my experience.

Thank you for the update. I will open the issue in test-infra, try to find the way to do it.

Sunnatillo · 2024-06-03T07:20:32Z

I opened an issue regarding this in test-infra:
kubernetes/test-infra#32687

k8s-triage-robot · 2024-09-01T07:26:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sbueringer · 2024-09-02T09:05:05Z

Maybe let's close this here until kubernetes-sigs/prow#195 has been implemented? (which might take a very long time if nobody volunteers for it)

fabriziopandini · 2024-09-04T08:46:02Z

As per comment above
/close

k8s-ci-robot · 2024-09-04T08:46:07Z

@fabriziopandini: Closing this issue.

In response to this:

As per comment above
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 26, 2024

sbueringer added the area/e2e-testing Issues or PRs related to e2e testing label Apr 26, 2024

k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 29, 2024

adilGhaffarDev added this to CAPI v1.8 release improvement tasks Apr 29, 2024

k8s-ci-robot assigned Sunnatillo May 30, 2024

Sunnatillo added this to CAPI v1.9 release improvement tasks Aug 27, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2024

k8s-ci-robot closed this as completed Sep 4, 2024

github-project-automation bot moved this to Done in CAPI v1.9 release improvement tasks Sep 4, 2024

github-project-automation bot moved this to Done in CAPI v1.8 release improvement tasks Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report failures of periodic jobs to the cluster-api Slack channel #10520

Report failures of periodic jobs to the cluster-api Slack channel #10520

sbueringer commented Apr 26, 2024

sbueringer commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

chrischdi commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

sbueringer commented Apr 26, 2024

sbueringer commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

sbueringer commented Apr 26, 2024

lentzi90 commented Apr 26, 2024

sbueringer commented Apr 26, 2024

fabriziopandini commented Apr 29, 2024

adilGhaffarDev commented Apr 29, 2024

Sunnatillo commented Apr 30, 2024

Sunnatillo commented May 30, 2024

Sunnatillo commented May 30, 2024

lentzi90 commented May 31, 2024

Sunnatillo commented May 31, 2024

Sunnatillo commented Jun 3, 2024

k8s-triage-robot commented Sep 1, 2024

sbueringer commented Sep 2, 2024 •

edited

Loading

fabriziopandini commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024

Report failures of periodic jobs to the cluster-api Slack channel #10520

Report failures of periodic jobs to the cluster-api Slack channel #10520

Comments

sbueringer commented Apr 26, 2024

sbueringer commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

chrischdi commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

sbueringer commented Apr 26, 2024

sbueringer commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

killianmuldoon commented Apr 26, 2024

sbueringer commented Apr 26, 2024

lentzi90 commented Apr 26, 2024

sbueringer commented Apr 26, 2024

fabriziopandini commented Apr 29, 2024

adilGhaffarDev commented Apr 29, 2024

Sunnatillo commented Apr 30, 2024

Sunnatillo commented May 30, 2024

Sunnatillo commented May 30, 2024

lentzi90 commented May 31, 2024

Sunnatillo commented May 31, 2024

Sunnatillo commented Jun 3, 2024

k8s-triage-robot commented Sep 1, 2024

sbueringer commented Sep 2, 2024 • edited Loading

fabriziopandini commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024

sbueringer commented Sep 2, 2024 •

edited

Loading