kubernetes 1.30.5 support #23230

karatkep · 2024-11-04T14:43:12Z

Summary

Dear Community,

Could you please help me verify if Eclipse Che 7.93.0 supports Kubernetes 1.30.5? The che-dashboard and che pods stopped working when our Kubernetes cluster was updated to version 1.30.5.

Here is a sample of the error in the che-dashboard:

ERROR[12:03:22 UTC]: [HTTP request failed[
    err: {
      "type": "le",
      "message": "HTTP request failed",
      "stack":
          HttpError: HTTP request failed
              at q._callback (/backend/server/backend.js:8:898957)
              at t._callback.t.callback.t.callback (/backend/server/backend.js:14:1087840)
              at q.emit (node:events:517:28)
              at q.<anonymous> (/backend/server/backend.js:14:1100418)
              at q.emit (node:events:517:28)
              at IncomingMessage.<anonymous> (/backend/server/backend.js:14:1099250)
              at Object.onceWrapper (node:events:631:28)
              at IncomingMessage.emit (node:events:529:35)
              at endReadableNT (node:internal/streams/readable:1400:12)
              at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
      "response": {
        "statusCode": 401,
        "body": {
          "kind": "Status",
          "apiVersion": "v1",
          "metadata": {},
          "status": "Failure",
          "message": "Unauthorized",
          "reason": "Unauthorized",
          "code": 401
        },
        "headers": {
          "audit-id": "6b14e1b5-8a08-41a8-a093-5e00693737a6",
          "cache-control": "no-cache, private",
          "content-type": "application/json",
          "date": "Mon, 04 Nov 2024 12:03:21 GMT",
          "content-length": "129",
          "connection": "close"
        },
        "request": {
          "uri": {
            "protocol": "https:",
            "slashes": true,
            "auth": null,
            "host": "10.1.0.1:443",
            "port": "443",
            "hostname": "10.1.0.1",
            "hash": null,
            "search": null,
            "query": null,
            "pathname": "/apis/org.eclipse.che/v2/checlusters",
            "path": "/apis/org.eclipse.che/v2/checlusters",
            "href": "https://10.1.0.1:443/apis/org.eclipse.che/v2/checlusters"
          },
          "method": "GET",
          "headers": {
            "Accept": "application/json",
            "Authorization": "Bearer MASKED"
          }
        }
      },
      "body": {
        "type": "Object",
        "message": "Unauthorized",
        "stack":
            
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {},
        "status": "Failure",
        "reason": "Unauthorized",
        "code": 401
      },
      "statusCode": 401,
      "name": "HttpError"
    }

The same issue affects the che pod. It appears that both lost access to the Kubernetes API after the upgrade to version 1.30.5.

ServiceAccounts, Cluster Roles and Bindings are in place for both che-dashboard and che pods

Relevant information

No response

The text was updated successfully, but these errors were encountered:

tolusha · 2024-11-04T16:46:13Z

@karatkep
Could you show che pod logs?

I've tried to reproduce on Minikube with Kubernetes 1.31.0, but no luck

karatkep · 2024-11-06T10:14:19Z

@tolusha
According to the che logs, the che pod starts receiving 401 errors from the kube-api exactly one hour after the pod starts working/launches:

06-Nov-2024 08:26:02.136 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/home/user/eclipse-che/tomcat/webapps/ROOT.war] has finished in [2,488] ms
06-Nov-2024 08:26:02.138 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
06-Nov-2024 08:26:02.144 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [40907] milliseconds
2024-11-06 09:26:32,950[c4d-k5x9l-37628]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@6c199c1d] for cluster [RemoteSubscriptionChannel], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:42,473[4c4d-k5x9l-3460]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@f31944b] for cluster [WorkspaceStateCache], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:47,468[c4d-k5x9l-46003]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@5ed91d32] for cluster [WorkspaceLocks], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]

karatkep · 2024-11-11T16:38:59Z

@tolusha, as I can see, the issue is that the token is not being refreshed. It is generated for 1 hour, and after that time, the che-dashboard continues to use it despite its expiration. Is there any way to prompt the che-dashboard to refresh it before using it for kube-api calls?

tolusha · 2024-11-12T13:59:27Z

@karatkep
Could you share CheCluster CR?
What OIDC provider do you use?

karatkep · 2024-11-12T18:44:22Z

@tolusha,
Yes, of course, I will provide the CheCluster CR. However, I don't think that the issue lies with the CheCluster CR or OIDC. The same version of Eclipse Che 7.93.0 was deployed in two identical AKS clusters (Kubernetes version 1.27.9), and everything was fine until one of the clusters was upgraded to 1.30.5. Immediately after this update, problems with the kube-api started. Reviewing the token used, for example, by the che-dashboard, I see that the expiration field "exp" is always the same and is in the past. From this, I conclude that for Kubernetes version 1.30.5, the token is not being updated.

karatkep · 2024-11-12T22:06:01Z

@tolusha , @ibuziuk , We found the root cause of the issue. In Kubernetes 1.27.9, the token (located at the path /var/run/secrets/kubernetes.io/serviceaccount/token) is issued for one year, although it is refreshed every hour (or more precisely every 50 minutes). At the same time, in Kubernetes 1.30.5, the token is issued for one hour and is also refreshed every 50 minutes. However, Che (che-dashboard, che, and most likely che-gateway) caches this token at startup and uses it. Consequently, in Kubernetes 1.27.9 there is no problem since the token is issued for one year, but in Kubernetes 1.30.5, the problem begins after the first hour from startup because the cached token is used.

tolusha · 2024-11-13T08:32:58Z

@karatkep
So, if you restart all pods, Che will continue working, right?

karatkep · 2024-11-13T09:13:11Z

@tolusha
Correct, we need to restart the Che pods every hour to ensure they remain operational.

karatkep · 2024-11-15T09:30:27Z

@tolusha, @ibuziuk,
Could you please share information and plans regarding this issue? Is everything clear and understandable? Were you able to reproduce it? Are you currently working on a resolution, or do you have plans to start working on it soon?

Just to be on the same page - there is absolutely no pressure from my side. I just want to understand the current status and plans regarding this issue. On my part, I have already used one of the possible workarounds and written a CronJob that restarts the necessary Che pods. If other Eclipse Che users are facing or will face the same issue, I am more than willing to share this workaround.

ibuziuk · 2024-11-15T10:09:47Z

@karatkep Thank you for the follow-up and investigation details - #23230 (comment)

I'm still wondering if the token lifetime is configurable on the k8s end in general?
Do you happen to have the link to the Release Notes, docs, or commit where this change with the lifetime was introduced? Could it be some AKS config?

The issue has been planned for the next sprint (Nov 20 - Dec 10), however, so far @tolusha was not able to reproduce it on vanilla minikube.

@karatkep also contributions from the Community are most welcome if you would like to change or update the caching mechanism in the project ;-)

karatkep · 2024-11-15T10:48:07Z

@ibuziuk,
When I was researching this issue, I came across the documentation at https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#tokenrequest-api which contains detailed information about configuring token lifetime. Moreover, I conducted an experiment where I disabled the che-operator (so it wouldn’t interfere with making changes) and used the expirationSeconds to modify the lifetime of the token. I tried setting it to one day or 86400 seconds for the che-dashboard in the deployment. After restarting the che-dashboard pod, I confirmed that the lifetime of the token (located in /var/run/secrets/kubernetes.io/serviceaccount/token) had indeed changed.

P.S. But frankly speaking, I do not like the option of using a long-lived token - it contradicts security best practices. It seems to me that whoever made this change (token lifetime: 1y -> 1h), it is a step in the right direction to use short-lived tokens. And in my opinion, a well-written application should not cache the token indefinitely.

vinokurig · 2024-12-06T11:08:08Z

I managed to decrease the kubernetes token lifetime to 10 minutes and I confirm that there are Kubernetes connection failure warnings coming every second right after the token expiration time. However, since kubernetes updates the roken in every pod, I could not reproduce the dashboard error and all kubernetes related actions work fine even after the token expiration.
Currently I am working on jgroups-kubernetes che-server dependency update. This library throws the error to the che-server log after the token expires.

vinokurig · 2024-12-09T09:27:52Z

Unfortunately updating the jgroups-kubernetes dependency to latest did not solve the issue with che-server cyclic log warning, filed an upstream issue.
As for the dashboard log error I could not reproduce it with the refreshed kubernetes token, all dashboard kubernetes related actions work fine, e.g PAT token add/list.

vinokurig · 2024-12-09T15:21:18Z

@karatkep could you please elaborate more on what exactly does not work regardless the logs errors? Can you open dashboard page, navigate to user preferences?

vinokurig · 2024-12-10T13:49:49Z

To summarize:

If kubernetes service account token is refreshed after expiration, all the functionality works as expected except the cyclic error in the che-server logs.
The che-server logs error is caused by the jgroups-kubernetes dependency. The dependency is not used for the current functionality and can be removed as a leftover, however we should consider either to update the dependency, when a new version with the fix is available, or to remove the dependency as a leftover and chek that it does not break the current functionality.
We are going to update the fabric8 kubernetes client to latest

ibuziuk · 2024-12-10T15:36:53Z

@karatkep my understanding is that so far @vinokurig was not able to reproduce the error even with the short-lived token. Steps to reproduce would be highly appreciated.

Basically, all k8s interactions are happening using Fabric8-Kubernetes-Client for che-server and we plan to bump it to version 7.0.0 next sprint.
cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - #23230 (comment) ?

manusa · 2024-12-11T04:47:17Z

cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - #23230 (comment) ?

I understand that the Kubernetes Client in use is 6.10.0.

In this case, yes there's a TokenRefreshInterceptor that reloads the config in case there is an auth client error in the HTTP response.

https://github.com/fabric8io/kubernetes-client/blob/9101a2fa4a8f912ff6cda23e4d4b59895ccdc755/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/utils/TokenRefreshInterceptor.java#L123-L126

The interceptor logic will work and reload the Config as long as the Config was not provided manually.
Does this ring any bell? Setting a breakpoint in the mentioned lines of code should allow you to debug what's going on the moment the authorization fails.

karatkep added the kind/question Questions that haven't been identified as being feature requests or bugs. label Nov 4, 2024

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 4, 2024

ibuziuk added this to Eclipse Che Team A Backlog Nov 5, 2024

ibuziuk moved this to 📅 Planned in Eclipse Che Team A Backlog Nov 15, 2024

ibuziuk added this to Red Hat OpenShift Dev Spaces and Web Terminal Priorities Nov 15, 2024

ibuziuk moved this to Todo in Red Hat OpenShift Dev Spaces and Web Terminal Priorities Nov 15, 2024

tolusha assigned akurinnoy, tolusha and vinokurig Nov 20, 2024

ibuziuk unassigned tolusha and akurinnoy Dec 5, 2024

ibuziuk moved this from Todo to In Progress in Red Hat OpenShift Dev Spaces and Web Terminal Priorities Dec 5, 2024

vinokurig moved this from 📅 Planned to 🚧 In Progress in Eclipse Che Team A Backlog Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes 1.30.5 support #23230

kubernetes 1.30.5 support #23230

karatkep commented Nov 4, 2024

tolusha commented Nov 4, 2024

karatkep commented Nov 6, 2024

karatkep commented Nov 11, 2024

tolusha commented Nov 12, 2024

karatkep commented Nov 12, 2024

karatkep commented Nov 12, 2024 •

edited

Loading

tolusha commented Nov 13, 2024

karatkep commented Nov 13, 2024

karatkep commented Nov 15, 2024

ibuziuk commented Nov 15, 2024

karatkep commented Nov 15, 2024

vinokurig commented Dec 6, 2024

vinokurig commented Dec 9, 2024 •

edited

Loading

vinokurig commented Dec 9, 2024

vinokurig commented Dec 10, 2024

ibuziuk commented Dec 10, 2024

manusa commented Dec 11, 2024

kubernetes 1.30.5 support #23230

kubernetes 1.30.5 support #23230

Comments

karatkep commented Nov 4, 2024

Summary

Relevant information

tolusha commented Nov 4, 2024

karatkep commented Nov 6, 2024

karatkep commented Nov 11, 2024

tolusha commented Nov 12, 2024

karatkep commented Nov 12, 2024

karatkep commented Nov 12, 2024 • edited Loading

tolusha commented Nov 13, 2024

karatkep commented Nov 13, 2024

karatkep commented Nov 15, 2024

ibuziuk commented Nov 15, 2024

karatkep commented Nov 15, 2024

vinokurig commented Dec 6, 2024

vinokurig commented Dec 9, 2024 • edited Loading

vinokurig commented Dec 9, 2024

vinokurig commented Dec 10, 2024

ibuziuk commented Dec 10, 2024

manusa commented Dec 11, 2024

karatkep commented Nov 12, 2024 •

edited

Loading

vinokurig commented Dec 9, 2024 •

edited

Loading