Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update otelcontribcol to latest #1391

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tiffanny29631
Copy link
Contributor

@tiffanny29631 tiffanny29631 commented Aug 13, 2024

Updating otelcontribcol to latest to include fix of open-telemetry/opentelemetry-collector-contrib#35006 and other essential security updates in components like Prometheus exporter.

The upgrade includes a breaking change of defaulting to local host instead of 0.0.0.0, the feature is now stable and feature gate has been removed, Config Sync should explicitly set endpoint to host 0.0.0.0 to expose port for external pod access. This applies to the metric receiver opencensus, and the exporter prometheus.

@google-oss-prow google-oss-prow bot requested a review from nan-yu August 13, 2024 23:20
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tiffanny29631. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi
Copy link
Contributor

karlkfi commented Aug 13, 2024

Why id this safe to re-apply now?

@tiffanny29631
Copy link
Contributor Author

Why id this safe to re-apply now?

Are you referring to the upstream fix or our release schedule?

@karlkfi
Copy link
Contributor

karlkfi commented Aug 14, 2024

I mean, if we reverted it before, why is it now safe to reapply? What changed?

@tiffanny29631
Copy link
Contributor Author

I mean, if we reverted it before, why is it now safe to reapply? What changed?

To ensure correct behavior, we've included a no_op_label in the metricstransform processor's label_set filter (explained in the description). This is a temporary solution until the upstream fix is available, but we need to update to the latest otelcontribcol regardless.

@sdowell
Copy link
Contributor

sdowell commented Aug 14, 2024

I mean, if we reverted it before, why is it now safe to reapply? What changed?

To ensure correct behavior, we've included a no_op_label in the metricstransform processor's label_set filter (explained in the description). This is a temporary solution until the upstream fix is available, but we need to update to the latest otelcontribcol regardless.

This isn't really a revert then, since new changes were added to the original commit. Suggest just creating a new commit instead of reapplying.

we need to update to the latest otelcontribcol regardless.

why?

@tiffanny29631
Copy link
Contributor Author

Re @sdowell While not currently anticipated, there's a possibility that future vulnerability fixes need dependency upgrades that are incompatible with current version we use (0.103). In such cases, the version upgrade process will be similar. Otelcontribcol versions beyond 0.104.0 will introduce breaking changes.

I could start a new PR.

@tiffanny29631 tiffanny29631 changed the title Reapply "Bump otelcontribcol version to v0.106.0-gke.2 (#1369)" (#1376) Update otelcontribcol to 0.106.0-gke.2 Aug 16, 2024
@tiffanny29631
Copy link
Contributor Author

Recreated the commit and message.

@tiffanny29631 tiffanny29631 changed the title Update otelcontribcol to 0.106.0-gke.2 Update otelcontribcol to latest Dec 16, 2024
@tiffanny29631 tiffanny29631 marked this pull request as draft December 16, 2024 22:33
This change updates otelcoontribcol to latest and modifies to fit breaking changes from 0.104.0 and 0.106.0.

Breaking change in 0.104.0 https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.104.0
Breaking change in 0.106.0 open-telemetry/opentelemetry-collector-contrib#34430

- Localhost is now the default setting, while otel-agent and otel-collector require 0.0.0.0, so the feature gate has been removed.
- The format of the environment variable was updated to meet the new syntax requirements. The otel-agent ConfigMap was split between the reconciler and controllers, ensuring that sync-related labels are only applied to reconcilers.
- A `no_op_label` has been added to ensure that the aggregation in the metricstransform processor filters on all metric labels. This is a temporary workaround until a permanent fix is implemented upstream.
Also specifying host 0.0.0.0 instead of the new default localhost.
For all receivers and health_check.

Adding networkpolicy for ingress on essential ports.
- port 13133: health_check for readiness check. Apply to otel-agent in reconciler-manager and reconcilers; Otel-collector.
- port 55678: opencensus receiver for otel-collector to receive metric from other pods

Fix test ConfigMap.
Only keep networkpolicy for otel-collector
label_set: []
# Using a no_op_label to get around issue in the upstream
# https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34430
label_set: [no_op_label]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does setting label_set to nil or removing it work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it. Maybe clarify the comment?

exporters:
opencensus:
endpoint: otel-collector.config-management-monitoring:55678
tls:
insecure: true
processors:
# Attributes processor adds custom configsync metric labels to applicable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these attributes get removed from the agent config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be replaced by manifests/otel-agent-reconciler-cm.yaml

@@ -81,6 +81,7 @@ spec:
ports:
- containerPort: 55678 # Default OpenCensus receiver port.
- containerPort: 8888 # Metrics.
- containerPort: 13133 # Health check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the otel-agent readiness probe port?

@tiffanny29631
Copy link
Contributor Author

/blocked open-telemetry/opentelemetry-collector-contrib#34844

Generated files (go.mod etc.) were removed, which would break AutoVC integration. No affect to the continuous build of the internal mirror so far.

A CI can be added to restore the generated files to a new branch, along with base image update for CVE scan purpose.

configMap:
name: otel-agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ConfigMap still used anywhere?

exporters:
opencensus:
endpoint: otel-collector.config-management-monitoring:55678
tls:
insecure: true
processors:
# Attributes processor adds custom configsync metric labels to applicable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be replaced by manifests/otel-agent-reconciler-cm.yaml

@janetkuo
Copy link
Contributor

@code-review-assist review

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request updates the otelcontribcol version and introduces several important changes to address a breaking change in the default endpoint behavior. The upgrade to v0.115.0-gke.0 defaults the opencensus receiver and prometheus exporter endpoints to localhost, requiring explicit configuration for external access. The PR description effectively communicates this change and its implications. The subsequent modifications to set the opencensus endpoint to 0.0.0.0:55678, the prometheus endpoint to 0.0.0.0:8675, and the health_check endpoint to 0.0.0.0:13133 are necessary and correctly implemented.

The introduction of the otel-agent-reconciler ConfigMap is a valuable addition. The accompanying comments explaining the rationale for the changes, including the removal of the attributes processor in the otel-agent ConfigMap and the use of no_op_label in the metric pipelines, enhance code clarity and maintainability. The comprehensive updates to the test files ensure consistency and validate the implemented changes.

To further improve this PR, I recommend the following:

  • README Update: Document the new otel-agent-reconciler ConfigMap, its purpose, and the changes to the default endpoints. This will aid users in understanding the modifications and configuring their deployments accordingly.
  • Version Bump Justification: While the PR description mentions the breaking change, it's beneficial to explicitly state the reason for the version bump to v0.115.0-gke.0 in the commit message or PR description. This helps track changes and provides valuable context for future maintainers.
  • NetworkPolicy Documentation: Add a comment to the NetworkPolicy in manifests/templates/otel-collector.yaml explaining its purpose and the ports it opens. This enhances readability and maintainability.
  • ConfigMap Clarification: Add a comment to test/kustomization/expected.yaml clarifying the relationship between the updated otel-agent-config.yaml and the otel-agent-reconciler-config.yaml.
  • Upstream Issue Link: Include a link to the upstream issue ([processor/metricstransform] Aggregation skipped when label_set is empty but not nil open-telemetry/opentelemetry-collector-contrib#34430) related to the use of no_op_label directly in the comments within pkg/metrics/otel.go. This facilitates quick access to the relevant context.

@@ -89,7 +89,7 @@ COSIGN := $(BIN_DIR)/cosign
GIT_SYNC_VERSION := v4.3.0-gke.4__linux_amd64
GIT_SYNC_IMAGE_NAME := gcr.io/config-management-release/git-sync:$(GIT_SYNC_VERSION)

OTELCONTRIBCOL_VERSION := v0.103.0-gke.6
OTELCONTRIBCOL_VERSION := v0.115.0-gke.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Document the reason for updating to v0.115.0-gke.0 (e.g., bug fixes, security updates, feature additions). This helps track changes and understand the context of the update.

Comment on lines 36 to 39
batch:
# Populate resource attributes from OTEL_RESOURCE_ATTRIBUTES env var and
# the GCE metadata service, if available.
resourcedetection:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Since the attributes processor is removed, delete these comments to maintain clarity and prevent confusion.

Suggested change
batch:
# Populate resource attributes from OTEL_RESOURCE_ATTRIBUTES env var and
# the GCE metadata service, if available.
resourcedetection:
# Populate resource attributes from OTEL_RESOURCE_ATTRIBUTES env var and
# the GCE metadata service, if available.
resourcedetection:
detectors: [env, gcp]

Comment on lines +74 to +89
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-port-ingress
namespace: config-management-monitoring
spec:
podSelector:
matchLabels:
app: opentelemetry
component: otel-collector
ingress:
- from:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 55678

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Adding a NetworkPolicy is a good security practice. Document its purpose (allowing ingress traffic on port 55678) for better maintainability.

Comment on lines +188 to +189
# Using a no_op_label to get around issue in the upstream
# https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34430

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Include a direct link to the upstream issue: open-telemetry/opentelemetry-collector-contrib#34430 for easier reference.

Comment on lines 5652 to +5679
otel-agent-config.yaml: |
receivers:
opencensus:
endpoint: 0.0.0.0:55678
exporters:
opencensus:
endpoint: otel-collector.config-management-monitoring:55678
tls:
insecure: true
processors:
batch:
# Populate resource attributes from OTEL_RESOURCE_ATTRIBUTES env var and
# the GCE metadata service, if available.
resourcedetection:
detectors: [env, gcp]
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions: [health_check]
pipelines:
metrics:
receivers: [opencensus]
processors: [batch, resourcedetection]
exporters: [opencensus]
telemetry:
logs:
level: "INFO"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Clarify the purpose of this otel-agent-config.yaml and its relationship to otel-agent-reconciler-config.yaml. For example, explain that this config is used for the main otel-agent, while the reconciler config is used for the otel-agent-reconciler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants