You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Kubernetes worker pods, perhaps only ones which did not register properly, accumulate forever.
The worker process exits "normally" without an indication in kubectl logs that there is some error, but kubernetes immediately restarts that worker (which then fails again).
Nothing causes these workers to go away.
For example, here is one that has been restarted a 1607 times, since it was initially launched over two weeks ago.
root@amber:~# minikube kubectl get pods
NAME READY STATUS RESTARTS AGE
funcx-1632329996841 0/1 CrashLoopBackOff 1607 (4m31s ago) 8d
[...]
In the parsl model of execution, a worker that fails should go away - the LRM/provider layer shouldn't be causing the workers to restart. Instead, feedback loop involves the parsl scaling strategy layer, which makes decisions about whether to restart more workers to replace failed workers if there is pressure for that many workers to still exist.
Perhaps funcx's fork of htex preserves that model - I'm unsure.
In the kubernetes context, maybe that means that worker pods should have restartPolicy: never, so that they go away on failure.
This is a continuation of the change introduced in parsl in Parsl/parsl#1073 to use pods rather than deployments for worker management, to move more management into the strategy code, away from kubernetes where it is not managed correctly.
In addition to my initial comment of these being pods that "did not register properly", I am experiencing this when killing and restarting an endpoint using "k8 delete pod funcx-endpoint-....". The worker pods managed by that endpoint are left behind forever.
Describe the bug
Kubernetes worker pods, perhaps only ones which did not register properly, accumulate forever.
The worker process exits "normally" without an indication in
kubectl logs
that there is some error, but kubernetes immediately restarts that worker (which then fails again).Nothing causes these workers to go away.
For example, here is one that has been restarted a 1607 times, since it was initially launched over two weeks ago.
...
Crossref parsl Parsl/parsl#2132 -- it's possible/likely that the parsl kubernetes code also has this behaviour.
To Reproduce
Start a worker with the python versions incorrectly configured.
Expected behavior
Broken worker pods should not accumulate without bound.
Environment
my minikube environment on ubuntu
The text was updated successfully, but these errors were encountered: