Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retire pending workers #820

Closed
wants to merge 0 commits into from
Closed

Retire pending workers #820

wants to merge 0 commits into from

Conversation

BitTheByte
Copy link

Based on discussion at #817

@BitTheByte BitTheByte marked this pull request as draft September 18, 2023 16:44
@BitTheByte BitTheByte marked this pull request as ready for review September 18, 2023 17:36
@BitTheByte
Copy link
Author

Everything looks good at least for now. I deployed the change to our production cluster and will provide updates if needed. Hopefully, everything works well 😄

@BitTheByte
Copy link
Author

BitTheByte commented Sep 18, 2023

Surprisingly I had the most stable run ever. One note to mention is if a pod is restarting which means it's deployment in an unready state a small possibility might happen:

  1. The operator gets the unready state
  2. pod restarts fast enough and starts executing work
  3. The operator kills the pod mid-run

So I was thinking of a way to execute the logic only on deployment that have pods in a pending state however I can't find a way to do that using kr8s

@BitTheByte
Copy link
Author

Done! now the operator takes actions based on pending pods rather than the deployments

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time here. A few thoughts.

Comment on lines 626 to 630
not in [
"ContainerCreating",
"Initializing",
"Terminating",
"Running",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be confusing the Pod phases with the status that kubectl displays. The possible values for phase are Pending, Running, Succeeded, Failed and Unknown.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#podstatus-v1-core

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you're correct I've been debugging this using k9s

dask_kubernetes/operator/controller/controller.py Outdated Show resolved Hide resolved
@BitTheByte
Copy link
Author

Sorry for the long delay, unfortunately, I don't have much time to work on this. meanwhile, I'll close this PR to allow someone else to finish this

@BitTheByte BitTheByte closed this Oct 22, 2023
@BitTheByte BitTheByte reopened this Jan 8, 2024
@BitTheByte
Copy link
Author

@jacobtomlinson I believe this is ready to merge

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI appears to be hanging as a result of these changes. This needs looking into before we can merge.

@BitTheByte
Copy link
Author

It is something related to building some go code. I tried to check what is wrong but didn't figure it out. it appears to be something related to the CI itself as I see most of the PRs are falling too.

@BitTheByte
Copy link
Author

@jacobtomlinson If possible can you suggest any solution to solve this issue?

@jacobtomlinson
Copy link
Member

Thanks for being patient here. I've nudged the CI back into a happy state and pulled main into this PR. So hopefully now any failures will only be related to this PR.

@jacobtomlinson
Copy link
Member

Ok the tests are still failing here so they are certainly hanging due to the changes in this PR.

@BitTheByte
Copy link
Author

Seems like they're timeout issues not something failing

@jacobtomlinson
Copy link
Member

Sure but I expect the CI is timing out because the tests are hanging. The tests on main do not hang, which suggests the changes here cause the tests to hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants