🫐 🐛 Operator gets stuck with bad configurations #196

c4milo · 2024-08-16T16:53:21Z

What happened?

Whenever a configuration change results in a redpanda pod falling into an unschedulable or crashloop state. It is impossible to correct the situation by only fixing the CR values. The values are taking but they are not reconciled by the operator and the statefulset remains using the wrong configuration.

See screen recording in: https://redpandadata.slack.com/archives/C01H6JRQX1S/p1723752154395579?thread_ts=1723751900.722069&cid=C01H6JRQX1S

What did you expect to happen?

If we make mistakes configuring container resources and/or limits in the Redpanda Custom Resource (CR), or any other configuration resulting in a broker crashlooping. We want to be able to correct it through the Redpanda CR and see the change instantly applied by the operator. No delays.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

$ helm get values <redpanda-release-name> -n <redpanda-release-namespace> --all
# paste output here

Anything else we need to know?

No response

Which are the affected charts?

Operator

Chart Version(s)

5.9.0

Cloud provider

Azure

JIRA Link: K8S-323

JIRA Link: K8S-324

chrisseto · 2024-08-23T20:51:53Z

So this is a bit nastier than I thought. I was under the impression that we could just yet force in the upgrade spec but the operator itself won't update the helm release if it sees that it's unhealthy which further makes this difficult to get out of.

redpanda-operator/src/go/k8s/internal/controller/redpanda/redpanda_controller.go

Lines 523 to 527 in 72ba3d3

    
           isResourceReady := r.checkIfResourceIsReady(log, msgNotReady, msgReady, resourceTypeHelmRepository, isGenerationCurrent, isStatusConditionReady, isStatusReadyNILorTRUE, isStatusReadyNILorFALSE, rp) 
        
           if !isResourceReady { 
        
           	// need to requeue in this case 
        
           	return v1alpha2.RedpandaNotReady(rp, "ArtifactFailed", msgNotReady), ctrl.Result{RequeueAfter: r.RequeueHelmDeps}, nil 
        
           }

For reference, this is how to set force but it doesn't really do anything given the operator's behavior.

  chartRef:
    upgrade:
      force: true

I'd vote to change the behavior to just always update the helm release regardless of it's existing status as that'll prevent users from fixing forward. @RafalKorepta WDYT?

RafalKorepta · 2024-08-27T08:38:16Z

Agree with you @chrisseto

andrewstucki · 2024-09-24T20:50:44Z

Fixed in #227

chrisseto transferred this issue from redpanda-data/helm-charts Aug 16, 2024

chrisseto mentioned this issue Sep 5, 2024

HelmRelease Upgrade Timeout is likely too short #217

Closed

andrewstucki mentioned this issue Sep 12, 2024

Calculate Redpanda CRD status conditions based on StatefulSet/Deployment status and unblock HelmReleases on job completion #227

Merged

andrewstucki closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🫐 🐛 Operator gets stuck with bad configurations #196

🫐 🐛 Operator gets stuck with bad configurations #196

c4milo commented Aug 16, 2024 •

edited

Loading

chrisseto commented Aug 23, 2024

RafalKorepta commented Aug 27, 2024

andrewstucki commented Sep 24, 2024

🫐 🐛 Operator gets stuck with bad configurations #196

🫐 🐛 Operator gets stuck with bad configurations #196

Comments

c4milo commented Aug 16, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

Anything else we need to know?

Which are the affected charts?

Chart Version(s)

Cloud provider

chrisseto commented Aug 23, 2024

RafalKorepta commented Aug 27, 2024

andrewstucki commented Sep 24, 2024

c4milo commented Aug 16, 2024 •

edited

Loading