-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidecar crash #36
Comments
Thanks for the question. Could you share the following information with us:
As this issue is actually cased by the underlaying gcsfuse, and it is also tacked by GoogleCloudPlatform/gcsfuse#402, we will work with the GCSFuse team to troubleshoot and fix it. |
Autopilot max limit for 8/16 A100 nodes for the sidecard as the max resources is 2 CPU and 14GB. It worked fine up until yesterday, but of course with these limits it is quite hard to have not enough resources to keep the Dataloaders workers and GPU occupancy high. Then since yesterday I can always reproduce this error. Adding the suggested debug flags doesn't add any specific details. |
Just an update on this. Otherwise you have to check "to the millimeter" with all timestamps between the pod and the sidecar |
The gcsfuse error As for the resource limit on Autopilot clusters, let's use #35 to track the limitation. Closing this issue for now as there is no action needed. Feel free to re-open if detailed logs shows that the timestamps between the pod and the sidecar being messed up. |
Hi I encoutered the same issue using the same sidecar architecture in cloudrun. Below are the complete logs
Edit: |
Recently the sidecar started to constantly crash after few minutes the pod started with this repeating error:
Dirver version
Running Google Cloud Storage FUSE CSI driver sidecar mounter version v0.1.3-gke.0
GKE
1.26.3-gke.1000
The text was updated successfully, but these errors were encountered: