Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods with unready Containers exist on this node, we can't clean the slots yet #681

Open
rpieczon opened this issue Nov 16, 2023 · 5 comments
Labels
bug Something isn't working keep-alive

Comments

@rpieczon
Copy link

Describe the bug

Akri agent daemonset keeps reporting following error whenever any of pod running on a cluster is not ready.

2023-11-16T13:44:46Z TRACE agent::util::slot_reconciliation] reconcile - Pods with unready Containers exist on this node, we can't clean the slots yet

In my case failing POD doesn't use USB resources.

Output of kubectl get pods,akrii,akric -o wide

lpfe04@f1725b929a:~$ kubectl get pod,akrii,akric -n akri
NAME                                              READY   STATUS    RESTARTS   AGE
pod/akri-agent-daemonset-9gwl2                    1/1     Running   0          10m
pod/akri-controller-deployment-7c6455f79-zt779    1/1     Running   0          11m
pod/akri-udev-discovery-daemonset-2d9hp           1/1     Running   0          10m
pod/akri-webhook-configuration-7bf6656b45-mclth   1/1     Running   0          11m

NAME                                  CONFIG        SHARED   NODES            AGE
instance.akri.sh/gsm-dongle-6e977d    gsm-dongle    false    ["f1725b929a"]   10m
instance.akri.sh/wifi-dongle-254c38   wifi-dongle   false    ["f1725b929a"]   10m
instance.akri.sh/wifi-dongle-ac917e   wifi-dongle   false    ["f1725b929a"]   10m

NAME                                CAPACITY   AGE
configuration.akri.sh/gsm-dongle    1          11h
configuration.akri.sh/wifi-dongle   1          11h

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

kubernetes: v1.26.8+rke2r1"

Expected behavior

I would expect reconciliation process can be continue if failing pod is out of usb usage.management context.

@rpieczon rpieczon added the bug Something isn't working label Nov 16, 2023
@github-project-automation github-project-automation bot moved this to Triage needed in Akri Roadmap Nov 16, 2023
@kate-goldenring
Copy link
Contributor

@rpieczon just to clarify, are you saying if any pod (even if unassociated with Akri) is unready, it causes this slot reconciliation error? From what i remember slot reconciliation should only check pods with an expected annotation.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 6, 2023

i lose track a little, but the annotations are on the container, not the pod i think ... and it might be that an unready pod is considered a potential place where an annotated container could eventually exist. might be worth looking at the resource requests to limit where this early exit happens.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 6, 2023

might be hard to check for the resource though. if the pod isn't ready and the container doesn't exist, there isn't much context to check the instances against.

@rpieczon
Copy link
Author

@rpieczon just to clarify, are you saying if any pod (even if unassociated with Akri) is unready, it causes this slot reconciliation error? From what i remember slot reconciliation should only check pods with an expected annotation.

Exactly in my case I have failing Prometheus POD which has zero requirements related with USB allocation.

@kate-goldenring kate-goldenring moved this from Triage needed to Investigating in Akri Roadmap Jan 9, 2024
@rpieczon
Copy link
Author

Any update on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keep-alive
Projects
Status: Investigating
Development

No branches or pull requests

3 participants