You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A few days ago, one line above, someone reported suspicious behaviour with the systems ability to recover from endpoint worker processes dying.
I have broadly recreated this by manually killing a worker process in my kubernetes dev environment. That task sits in Z state, which means that that process which launched hasn't got round to retrieving the exit code yet.
A task subsequently launched against this endpoint goes as far as this:
Task is pending due to waiting-for-launch
and then nothing further happens.
Eventually I killed the worker pod and that progressed the task into this failure: from get_result: Serialization Error during: Task's exception object deserialization
I'll note the parsl fork of htex seems to be able to detect worker loss with: parsl.executors.high_throughput.errors.WorkerLost: Task failure due to loss of worker 0 on host parsl-dev-3-9-5568
I think that in the case of a multi-worker endpoint, and with users assuming that funcx "hangs sometimes, I'll just retry", then this would manifest as a performance problem rather than ongoing hangs: one worker being vanished/hung and blocked on an abandoned task while others still continue to perform work would result in subsequent work still proceeding, at reduced pace. (as long as you have one worker left, hung/missing workers will manifest as a performance reduction and a single hung task per worker)
To Reproduce
This was easily reproducible on my kubernetes dev cluster by putting a sys.exit into a funcx function and invoking it.
Expected behavior
Disappeared workers should restart, or some other recovery behaviour
Environment
my kubernetes dev cluster; main branch of everything as of 2022-02-28
The text was updated successfully, but these errors were encountered:
Describe the bug
A few days ago, one line above, someone reported suspicious behaviour with the systems ability to recover from endpoint worker processes dying.
I have broadly recreated this by manually killing a worker process in my kubernetes dev environment. That task sits in Z state, which means that that process which launched hasn't got round to retrieving the exit code yet.
A task subsequently launched against this endpoint goes as far as this:
Task is pending due to waiting-for-launch
and then nothing further happens.
Eventually I killed the worker pod and that progressed the task into this failure: from get_result: Serialization Error during: Task's exception object deserialization
I'll note the parsl fork of htex seems to be able to detect worker loss with: parsl.executors.high_throughput.errors.WorkerLost: Task failure due to loss of worker 0 on host parsl-dev-3-9-5568
I think that in the case of a multi-worker endpoint, and with users assuming that funcx "hangs sometimes, I'll just retry", then this would manifest as a performance problem rather than ongoing hangs: one worker being vanished/hung and blocked on an abandoned task while others still continue to perform work would result in subsequent work still proceeding, at reduced pace. (as long as you have one worker left, hung/missing workers will manifest as a performance reduction and a single hung task per worker)
To Reproduce
This was easily reproducible on my kubernetes dev cluster by putting a sys.exit into a funcx function and invoking it.
Expected behavior
Disappeared workers should restart, or some other recovery behaviour
Environment
my kubernetes dev cluster; main branch of everything as of 2022-02-28
The text was updated successfully, but these errors were encountered: