-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate restarting pods in-place as a mechanism for faster failure recovery #467
Comments
Reading this, why wouldn't one just want to use |
BackoffLimit only applies to that particular Job, but we need any pod failure in any Job to trigger pod restarts across all Jobs. |
Sorry, I just joined this project and I'm still trying to understand many things. I'm not sure if I'm thinking correctly. I want to confirm whether the intention is to restart the entire ReplicatedJobs when any job fails, or to restart the entire jobset. |
Right now it means the entire jobset, but after #381 is implemented, it will depend on if (and how) a failure policy is configured. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Update: After prototyping and investigating this further, I've identified 2 upstream changes that will be needed for approaches based on "kill container process to force Kubelet to recreate it" logic:
I'll continue pursuing these upstream while I investigate alternatives for the short-term. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale I have another idea for how to do this actually, I may try prototyping it |
What would you like to be added:
A faster restart/failure recovery mechanism that doesn't involve recreating all the child jobs, rescheduling all of the pods, etc.
Why is this needed:
For faster failure recovery, reducing downtime for batch/training workloads which experience some kind of error.
The text was updated successfully, but these errors were encountered: