Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"nvidia-gpu-operator" does not exist #10

Open
kabelo-twala opened this issue Aug 14, 2024 · 2 comments
Open

"nvidia-gpu-operator" does not exist #10

kabelo-twala opened this issue Aug 14, 2024 · 2 comments

Comments

@kabelo-twala
Copy link

kabelo-twala commented Aug 14, 2024

Hi, this used to work alright a few days ago but now after running cdk deploy Comfyui-Cluster the stack rolls back with the following error.

Received response status [FAILED] from custom resource. Message returned: Error: b'Release "nvidia-gpu-operator" does not exist. Installing it now.\nError: failed to fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v24.3.0.tgz : 401 Unauthorized\n' Logs: /aws/lambda/Comfyui-Cluster-awscdkawseksKubect-Handler886CB40B-Pn5FFiLl2QgE at invokeUserFunction

As you can imagine working with Cloudformation takes a while to complete events and I've spent a couple of hours on this issue. Please assist.

Screenshot 2024-08-14 at 12 17 26
@kabelo-twala
Copy link
Author

kabelo-twala commented Aug 14, 2024

I looked up the issue and found this which is a workaround that worked.

I've updated the repository value in node_modules/@aws-quickstart/eks-blueprints/dist/addons/gpu-operator/index.js

from

const defaultProps = {
name: "gpu-operator-addon",
namespace: "gpu-operator",
chart: "gpu-operator",
version: "v24.3.0",
release: "nvidia-gpu-operator",
repository: "https://helm.ngc.nvidia.com/nvidia",
createNamespace: true,
values: {}
};

to

const defaultProps = {
name: "gpu-operator-addon",
namespace: "gpu-operator",
chart: "gpu-operator",
version: "v24.3.0",
release: "nvidia-gpu-operator",
repository: "https://nvidia.github.io/gpu-operator",
createNamespace: true,
values: {}
};

and after running cdk deploy Comfyui-Cluster all the Cloudformation events were created.

Again this is a temporary fix. It would be great if this was addressed. I'll also create an issue on the package for the @aws-quickstart owners to maybe resolve

@Shellmode
Copy link
Contributor

Thanks for your deep dive, @kabelo-twala. I'll try to reproduce the issue in a new environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants