-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using Elastic Index Job in exclusive placement implementation #482
Comments
I did look into supporting elastic jobs with jobset. I opened up a PR but I found out that we enforce immutability on all replicated jobs. I'm still waiting for some comments on what kind of validation we would want for replicated jobs. |
We don't need to mutate replicatedJobs for this issue, the controller would be performing this logic on individual Jobs. The thing you're exploring is different (elastic replicated jobs, enabling number of replicas to scale up or down). This issue is scaling the number of Job pods up after creation, which doesn't require mutating the replicatedJob. |
Yikes.. Thats a good distinction but I'm not sure we want to support that in that way.. I guess the main things you'd want to see is how that does break jobset status counting. JobSet is going to converge to what values are in its spec. If users patch the underlying jobs then I wonder what happens with the jobset statuses. I think this is why deployment/statefulset encourage users to use scale on those applications rather than editing the pods directly. |
It's not the user patching the jobs, it is the JobSet controller. It creates each child Job with 1 pod (leader pod) and once it's scheduled, updates the child Job completions and parallelism to match what's in the JobSet spec. This way we avoid spamming the apiserver with pod creating requests that we know will be rejected, job controller will undergo exponential backoff up if the leader pod takes too long to schedule, delaying the time for all the workers to be ready. Regardless, this is just an issue to prototype it and see how it performs versus the current webhook based implementation. |
Okay. I was just thinking that this would make reproducibility a bit challenging as the spec is not the desired state. but for a prototype I think its worth exploring |
What is the current state of this work? I see it is unassigned, if nobody is actively working on it, I am interested in picking it up. |
@dejanzele That would be great! Let me know if you have any questions on the idea or the implementation. |
Thanks, I'll get familiar and ping you soon /assign |
/kind feature |
@dejanzele are you still planning to explore this? No rush, this isn't needed urgently but just want to check in. |
@danielvegamyhre yes, I plan to start this week |
@dejanzele how is this going? |
@danielvegamyhre I got sidetracked with some other work, I don't think I'll have capacity in the next 2-3 weeks. /unassign |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale I still think this would be useful to prototype and benchmark against the current implementation, in theory it should be more performant. I can prototype it but currently don't have access to large clusters for scale testing. |
Right now, the implementation of exclusive job placement per topology domain (node pool, zone, etc) relies on a pod webhook which allows leader pods (index 0) to be admitted, created, and scheduled, but blocks follower pods (all other indexes) until the leader pod for that child Job is scheduled.
The leader pods have pod affinity/anti-affinity constraints ensuring each leader pod lands in a different topology. Follower pods have nodeSelectors injected by the pod mutating webhook ensuring the land on the same topology as their leader.
This is an improvement over the original implementation of using pod affinity/anti-affinity constraints on all pods (which did not scale well due to the pod affinity rule computation time scaling linearly with the number of pods and nodes). However, the repeated wasted follower pod creation attempts are putting unnecessary pressure on the apiserver.
One possible option is to use Elastic Indexed Jobs to first create every Job with completions == parallelism == 1 (so this will only create index 0 / leader pods). The pod webhook will still inject pod affinity/anti-affinity constraints into the leader pods as it currently does.
Once a leader pod for a given Job is scheduled, resize the Job to have completions == parallelism == . The follower pods will then be created and have the nodeSelector injected to follow the leader, as it currently does. This will minimize pressure on the apiserver by avoiding unnecessary pod creation attempts.
The text was updated successfully, but these errors were encountered: