Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🫐 🐛 Operator gets stuck with bad configurations #196

Closed
c4milo opened this issue Aug 16, 2024 · 3 comments
Closed

🫐 🐛 Operator gets stuck with bad configurations #196

c4milo opened this issue Aug 16, 2024 · 3 comments

Comments

@c4milo
Copy link
Member

c4milo commented Aug 16, 2024

What happened?

Whenever a configuration change results in a redpanda pod falling into an unschedulable or crashloop state. It is impossible to correct the situation by only fixing the CR values. The values are taking but they are not reconciled by the operator and the statefulset remains using the wrong configuration.

See screen recording in: https://redpandadata.slack.com/archives/C01H6JRQX1S/p1723752154395579?thread_ts=1723751900.722069&cid=C01H6JRQX1S

What did you expect to happen?

If we make mistakes configuring container resources and/or limits in the Redpanda Custom Resource (CR), or any other configuration resulting in a broker crashlooping. We want to be able to correct it through the Redpanda CR and see the change instantly applied by the operator. No delays.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

$ helm get values <redpanda-release-name> -n <redpanda-release-namespace> --all
# paste output here

Anything else we need to know?

No response

Which are the affected charts?

Operator

Chart Version(s)

5.9.0

Cloud provider

Azure

JIRA Link: K8S-323

JIRA Link: K8S-324

@chrisseto chrisseto transferred this issue from redpanda-data/helm-charts Aug 16, 2024
@chrisseto
Copy link
Contributor

So this is a bit nastier than I thought. I was under the impression that we could just yet force in the upgrade spec but the operator itself won't update the helm release if it sees that it's unhealthy which further makes this difficult to get out of.

isResourceReady := r.checkIfResourceIsReady(log, msgNotReady, msgReady, resourceTypeHelmRepository, isGenerationCurrent, isStatusConditionReady, isStatusReadyNILorTRUE, isStatusReadyNILorFALSE, rp)
if !isResourceReady {
// need to requeue in this case
return v1alpha2.RedpandaNotReady(rp, "ArtifactFailed", msgNotReady), ctrl.Result{RequeueAfter: r.RequeueHelmDeps}, nil
}

For reference, this is how to set force but it doesn't really do anything given the operator's behavior.

  chartRef:
    upgrade:
      force: true

I'd vote to change the behavior to just always update the helm release regardless of it's existing status as that'll prevent users from fixing forward. @RafalKorepta WDYT?

@RafalKorepta
Copy link
Contributor

Agree with you @chrisseto

@andrewstucki
Copy link
Contributor

Fixed in #227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants