-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record pod Ready status, or all pod status conditions #32941
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This feels reasonable, especially if there is existing precedence in KSM. Is |
Do you mean k8sclusterreceiver? It has existing watch support for Pod, which should cover the Status struct AFAIK. k8sobjectsreceiver, I'm not certain, but glancing at the field selection language I'm not sure how you would get the ready condition out of the list. |
I think given we have Regarding pod.status.condition does Kube State Metrics have it? Would be interesting to see what they do around that and what kind of values are there. |
Please no. See #29157 for a generic |
In general, I feel like the piecemeal and fragmented approach to Kubernetes metrics in OTel is going to make it impossible to have any vendors build meaningful user experiences on top of this dataset, ultimately delaying any possible standardization in this space. |
In this case Regarding:
I guess your against this enum approach? IMO then this should be added to the spec, how to correctly represent enums in otel, so we can solve it once and for all. |
We have had an uptick in k8s components feature request, it would be be very nice to make progress on open-telemetry/semantic-conventions#1032 |
Thanks @TylerHelmuth . In general, my team has been happy with the metrics collected by |
I made a bit of a side comment there. I personally feel the value-as-attribute approach to enums has the better user experience, and that's what we should favor, but the clear downside is extraneous metric volume when reporting zeroes. Machines use the numeric enum, but humans want the string. I think some of this can be resolved in spec, and some ideas are outlined in open-telemetry/opentelemetry-specification#1711, but given there's no consensus for some time, I'd suggest we move forward with existing patterns here. Whichever direction we go there's a need to review and refactor existing metrics in the collector. More than we need something ideal, we need something usable.
I think the problem with resource metrics for that (as I understand it) is they're effectively metadata, emitted to enrich an actual point. In the pod state example, the state change itself is the point. Can we move forward, regardless of the enum end-game? I think we need to choose whether we represent pod ready status as e.g. |
I agree on:
I personally would be okay with Regarding |
Whoops, thought I answered your question earlier, but apparently didn't hit send. The generic approach would take any condition and record points for it. Here's a sample from a GKE cluster: status:
conditions:
- lastProbeTime: null
lastTransitionTime: null
message: 'Pod is in NEG "Key{\"k8s1-7929355b-foo-80-6e399202\",
zone: \"foo-b\"}". NEG is not attached to any BackendService with health
checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'
reason: LoadBalancerNegWithoutHealthCheck
status: "True"
type: cloud.google.com/load-balancer-neg-ready
- lastProbeTime: null
lastTransitionTime: "2024-05-28T19:14:30Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-05-28T19:14:36Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-05-28T19:14:35Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-05-28T19:14:20Z"
status: "True"
type: PodScheduled
The enum is the
Edit: as pointed out below, message is probably too high cardinality for many systems, and has less value, so nix that. |
IMO Also, this looks like more like events to me rather than metric that tracks state changes? Maybe instead if should be |
I think it has potentially unbounded cardinality! Agree it's not a fit. I think
The k8scluster receiver works by watching over a pod resource, so we emit the current condition each time it changes. I don't think it makes sense to call it "last condition"? |
What about something like
I like the idea of each status having it's own enableable metric, with |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
receiver/k8scluster
Is your feature request related to a problem? Please describe.
I would like to be able to see a pod's Ready status in metrics.
Describe the solution you'd like
A named metric with an analog to
k8s.container.ready
,likek8s.pod.ready
.Or, a more generic solution where all pod status conditions are represented, i.e.
k8s.pod.status.condition
. I might prefer this as it captures more states.I would like to include attributes for the detail info fields (
message
,reason
), for either approach.Describe alternatives you've considered
Using the existing k8s.container.ready metric. It's unfortunately not complete because of pod ReadinessGates.
Additional context
See #32933 for an impl of
k8s.pod.ready
, currently sans attributes.If we can agree, I think the condition status metric makes more sense. I think we need agreement on how to represent it: other things in this receiver (e.g.
k8s.pod.phase
) use values to represent states, so 0=false, 1=true, 2=unknown could fit, but the ergonomics are a bit poor.The text was updated successfully, but these errors were encountered: