Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeCluster with service type LoadBalancer on AKS stuck in "Pending" state on startup #906

Closed
catzc opened this issue Sep 16, 2024 · 7 comments · Fixed by #908
Closed

KubeCluster with service type LoadBalancer on AKS stuck in "Pending" state on startup #906

catzc opened this issue Sep 16, 2024 · 7 comments · Fixed by #908

Comments

@catzc
Copy link
Contributor

catzc commented Sep 16, 2024

Describe the issue:
I'm trying to create a KubeCluster with AKS LoadBalancer, but the DaskCluster is stuck in a pending state at startup.

Minimal Complete Verifiable Example:

Trying to create a cluster like this:

def make_sim_cluster(
        name: str,
        cpu_cores: int,
        memory_in_gb: int, 
        image: str = '<image>',
        tag: str = 'x.x.x',
        namespace: str = 'namespace',
):
    base_cluster_spec = make_cluster_spec(
        name=name,
        image=f'{image}:{tag}', 
        n_workers=cpu_cores,
        scheduler_service_type="LoadBalancer",
    )
    base_cluster_spec['metadata']['annotations'] = {
        "service.beta.kubernetes.io/azure-load-balancer-internal": "true",
    }

    return KubeCluster(custom_cluster_spec=base_cluster_spec, namespace=namespace)

but the DaskCluster is stuck in Pending State until it times out
image
image

Anything else we need to know?:

I looked around at our service spec and it seems like our resource definition has our load balancer under a key "loadBalancer" when the scheduler service comes up (as referenced in kubernetes doc https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer)
image
while dask-kubernetes seems to look around for "load_balancer" in status:

if spec["type"] == "LoadBalancer" and not len(
status.get("load_balancer", {}).get("ingress", [])
):
phase = "Pending"

for what it's worth, the kr8s library also seems to look around for "load_balancer".

When I switched these over to using "loadBalancer" the DaskCluster was able to come up without any issues. Actually even when I didn't change the code, the service, pods, deployments etc were all able to be setup and in a running state, only the DaskCluster was pending.

Any helps is great, thanks!

Environment:

  • Dask version:
    • dask-core: 2024.2.1
    • dask-kubernetes: 2024.5.0
  • Python version: 3.11.9
  • Operating System: RHEL 7.9 (Maipo)
  • Install method (conda, pip, source): conda
@catzc catzc changed the title _wait_for_scheduler_service looking for key which doesn't exist in LoadBalancer KubeCluster with service type LoadBalance on aks stuck in "Pending" state on startup Sep 16, 2024
@catzc catzc changed the title KubeCluster with service type LoadBalance on aks stuck in "Pending" state on startup KubeCluster with service type LoadBalancer on aks stuck in "Pending" state on startup Sep 16, 2024
@catzc catzc changed the title KubeCluster with service type LoadBalancer on aks stuck in "Pending" state on startup KubeCluster with service type LoadBalancer on AKS stuck in "Pending" state on startup Sep 16, 2024
@jacobtomlinson
Copy link
Member

Thanks for raising this. Internally kr8s uses box to make accessing nested data structures more pleasant and pythonic. At one point early on we were using the camel killer box to make keys like loadBalancer more pythonic by converting them to snake case like load_balancer.

>>> import box
>>> data_in = {"loadBalancer": True}
>>> data = box.Box(data_in, camel_killer_box=True)

You can access the loadBalancer key using either .loadBalancer or .load_balancer. Which is very pleasant.

>>> data.load_balancer
True
>>> data.loadBalancer
True

Unfortunately this operation in box is destructive and modifies the underlying data structure to store data in the snake case form. Which means when converting back to a dictionary you don't get the same structure that you put in.

>>> data.to_dict()
{'load_balancer': True}

Given that kr8s needs to be able to update data back to the Kubernetes API we can't be mangling the data structures like this. So at some point we dropped the camel killer box.

It looks like this bit of code where we look up the load balancer info is still using the snake case and needs updating to camel case.

@jacobtomlinson
Copy link
Member

I've opened up #907 to resolve this here. @blankhold could you confirm that these are the changes you made that fixed things for you? It's a little hard to test this fix as we can't provision loadbalancers easily in our test suite.

@catzc
Copy link
Contributor Author

catzc commented Sep 17, 2024

Awesome, thanks for the explanation and PR @jacobtomlinson!

Seems like there are a few more patches I had to make. I pulled in the changes from your PR and also the changes you recently made in kr8s kr8s-org/kr8s@665ff53.

Some other things:

  1. Startup still gets stuck in _wait_for_controller.

    • I got around this by having it asyncio.sleep(5) then immediately return, skipping the check. (For some reason the status in the daskcluster is never being updated to phase: "Running"). Doing an immediate return will give us a BoxKeyError
  2. lb.hostname also gives a BoxKeyError

    • removed that entirely, but seems like we might need the default_box_none_transform for Box or refactoring that section

With those changes above, I was able to get a daskcluster with LoadBalancer service and access the dashboard externally

@jacobtomlinson
Copy link
Member

Ok great! Perhaps it's easier if you put up a PR that supersedes mine that makes the changes you need and we can discuss specifics there.

catzc added a commit to catzc/dask-kubernetes that referenced this issue Sep 17, 2024
catzc added a commit to catzc/dask-kubernetes that referenced this issue Sep 17, 2024
@catzc
Copy link
Contributor Author

catzc commented Sep 17, 2024

Thanks! Created #908 which I think should fix the issues. Also, know when kr8s can make a new release to pick up your recent changes when we release a new version of dask-kubernetes? Or how will the release go so that we can finally pick up these changes once merged?

catzc added a commit to catzc/dask-kubernetes that referenced this issue Sep 17, 2024
catzc added a commit to catzc/dask-kubernetes that referenced this issue Sep 17, 2024
@catzc
Copy link
Contributor Author

catzc commented Sep 17, 2024

Didn't realize how to test the controller before, so my getting stuck in wait_for_controller commend might not be applicable/might just be fixed when rebuilding the controller to test. Will let you know if that worked

@jacobtomlinson
Copy link
Member

I've just released kr8s version 0.17.4 and dask-kubernetes version 2024.9.0 with these fixes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants