Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade operator, the default auto-instrumentation image in Instrumentation has not been upgraded #3468

Open
chinaran opened this issue Nov 17, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@chinaran
Copy link
Contributor

Component(s)

auto-instrumentation

What happened?

Description

Upgrade operator, the default auto-instrumentation image in Instrumentation has not been upgraded

Steps to Reproduce

operator v0.107.0 was deployed using helm chart, to be migrated to OLM deployment and upgraded to v0.113.0.
Also, the default-auto-instrumentation-java-image version should be upgraded.
When v0.113.0 was deployed after removing v0.107.0, the mutate webhook failed during the upgrade of Instrumentation because the service was not ready.
As a result, auto-instrumentation-java-image was not upgraded.

Expected Result

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  annotations:
    instrumentation.opentelemetry.io/default-auto-instrumentation-java-image: xxx-v2
spec:
  java:
    image: xxx-v2

Actual Result

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  annotations:
    instrumentation.opentelemetry.io/default-auto-instrumentation-java-image: xxx-v1
spec:
  java:
    image: xxx-v1

Kubernetes Version

v1.29.2

Operator version

v0.113.0

Collector version

v0.108.1

Environment information

No response

Log output

{"level":"ERROR","timestamp":"2024-11-17T03:30:27.899239101Z","logger":"instrumentation-upgrade","message":"failed to apply changes to instance","name":"xxx-java","namespace":"cpaas-system","error":"Internal error occurred: failed calling webhook \"minstrumentation.kb.io\": failed to call webhook: Post \"https://opentelemetry-operator-controller-manager-service.xxx-ns.svc:443/mutate-opentelemetry-io-v1alpha1-instrumentation?timeout=10s\": dial tcp xxx.xxx.xxx.xxx:443: connect: connection refused","stacktrace":
"github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation/upgrade.(*InstrumentationUpgrade).ManagedInstances
/workspace/source/pkg/instrumentation/upgrade/upgrade.go:99
main.addDependencies.func2
/workspace/source/main.go:473
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
/workspace/cache/gomod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:307
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/cache/gomod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

Additional context

Should I add retry logic to upgrade Instrumentation?

@chinaran chinaran added bug Something isn't working needs triage labels Nov 17, 2024
@swiatekm
Copy link
Contributor

I don't think controller-runtime lets us specify dependencies between runnables, so it won't help us in this case. It does sound like we have a race condition at startup, where the upgraders depend on webhooks to run, but we don't guarantee that the webhooks are started first.

Maybe the solution is to just run the upgraders periodically forever? That they're only run at startup doesn't make that much sense to me. @pavolloffay @jaronoff97 wdyt?

@jaronoff97
Copy link
Contributor

I thought we run the upgraders on the reconcile loop which should be called every 15mins?

On second look, we don't do this on the webhook or otherwise, which seems wrong so yeah i think we should run them just with the reconcile loop most likely.

@swiatekm
Copy link
Contributor

I thought we run the upgraders on the reconcile loop which should be called every 15mins?

On second look, we don't do this on the webhook or otherwise, which seems wrong so yeah i think we should run them just with the reconcile loop most likely.

Upgraders are just runnables added to the controller runtime manager, and right now they run once at operator startup. When they should run exactly isn't entirely clear to me. I don't like the idea of running them during reconciliation, because reconciliation already does a lot, and it not modifying the CR (other than the status) makes reasoning about it much simpler. Running it in the webhook is safe, but arguably not that useful. It makes most sense to me to upgrade any eligible CR as soon as possible, and that's most in-line with how the feature works right now, but is also quite invasive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants