Resource limitation for the sidecar container on Autopilot #35

bhack · 2023-06-04T21:32:38Z

Looking at the default pytorch example in this repository I see some performance incompatibilities with the minimum autopilot resources request[1].
I think that we will have many problem allocating sidecar resources if we have these high min limits in autopilot.

gcs-fuse-csi-driver/examples/pytorch/train-job-pytorch.yaml

Lines 35 to 39 in 8a8d871

    
           annotations: 
        
             gke-gcsfuse/volumes: "true" 
        
             gke-gcsfuse/cpu-limit: "10" 
        
             gke-gcsfuse/memory-limit: 40Gi 
        
             gke-gcsfuse/ephemeral-storage-limit: 20Gi

[1]https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests

bhack · 2023-06-04T21:36:40Z

Please consider also that Autopilot is officially the default and recommended GKE since April.

bhack · 2023-06-08T12:51:46Z

@songjiaxun Do you prefer to have this on https://issuetracker.google.com ?

songjiaxun · 2023-06-09T17:57:13Z

Thanks for the question. I admit that the pytorch example may not work in Autopilot clusters. I am actively working on the AI/ML application tests and will update the example yaml soon.

@bhack are you a Googler by any chance? Could you DM me with more context?

bhack · 2023-06-09T18:06:18Z

I've DM to you.

It is not only pytorch, It will not work any real DL scenario as the CPU limit on large nodes for the sidecard it will be MAX:
2 CPU and 14GB Memory.

bhack · 2024-04-11T11:18:13Z

I think we have regressed a bit here.

Now autopilot is going to accept unlimited/burstable resource on the sidecard: #61

But it "secretly" overriding with minimal resource.
This by an usability point of view it is very confusing as users have direct notification about this overriding so they could expect to work in a burstable context.

Manually scaling sidecar cpu resources it is going to not let the pod scheduling on Autopilot (E.g. >6000m on H100):

Violations details: {"[denied by autogke-no-node-updates]":["Operation on nodes with changes in addition to cordon is not allowed in Autopilot."]}
Requested by user: 'system:serviceaccount:gpu-operator:node-feature-discovery', groups: 'system:serviceaccounts,system:serviceaccounts:gpu-operator,system:authenticated'.```

ronaldpanape · 2024-06-25T10:46:13Z

whats the status as it relates to auto pilot here?

bhack mentioned this issue Jun 4, 2023

Profile/pressure analisys best practices #34

Open

songjiaxun mentioned this issue Jul 14, 2023

Sidecar crash #36

Closed

songjiaxun added the enhancement New feature or request label Jul 14, 2023

songjiaxun changed the title ~~Autopilot resources/perf~~ Resource limitation for the sidecar container on Autopilot using GPU: 2 CPU and 14GB Memory Jul 14, 2023

bhack changed the title ~~Resource limitation for the sidecar container on Autopilot using GPU: 2 CPU and 14GB Memory~~ Resource limitation for the sidecar container on Autopilot Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource limitation for the sidecar container on Autopilot #35

Resource limitation for the sidecar container on Autopilot #35

bhack commented Jun 4, 2023

bhack commented Jun 4, 2023

bhack commented Jun 8, 2023

songjiaxun commented Jun 9, 2023

bhack commented Jun 9, 2023

bhack commented Apr 11, 2024

ronaldpanape commented Jun 25, 2024

Resource limitation for the sidecar container on Autopilot #35

Resource limitation for the sidecar container on Autopilot #35

Comments

bhack commented Jun 4, 2023

bhack commented Jun 4, 2023

bhack commented Jun 8, 2023

songjiaxun commented Jun 9, 2023

bhack commented Jun 9, 2023

bhack commented Apr 11, 2024

ronaldpanape commented Jun 25, 2024