Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar crash #36

Closed
bhack opened this issue Jun 9, 2023 · 5 comments
Closed

Sidecar crash #36

bhack opened this issue Jun 9, 2023 · 5 comments

Comments

@bhack
Copy link

bhack commented Jun 9, 2023

Recently the sidecar started to constantly crash after few minutes the pod started with this repeating error:

E0609 LookUpInode: interrupted system call, list objects: Error in iterating through objects: context canceled
E0609 fuse: *fuseops.LookUpInodeOp error: interrupted system call
E0609 LookUpInode: interrupted system call, list objects: Error in iterating through objects: context canceled
E0609  fuse: *fuseops.LookUpInodeOp error: interrupted system call
E0609  ReadFile: interrupted system call, fh.reader.ReadAt: readFull: context canceled
E0609  *fuseops.ReadFileOp error: interrupted system call

Dirver version
Running Google Cloud Storage FUSE CSI driver sidecar mounter version v0.1.3-gke.0

GKE
1.26.3-gke.1000

@songjiaxun
Copy link
Contributor

Thanks for the question.

Could you share the following information with us:

  • The sidecar container resource configuration.
  • What is the IO pattern in your workload before the crash happen? Is the workload trying to ls a folder with many files?

As this issue is actually cased by the underlaying gcsfuse, and it is also tacked by GoogleCloudPlatform/gcsfuse#402, we will work with the GCSFuse team to troubleshoot and fix it.

@bhack
Copy link
Author

bhack commented Jun 9, 2023

The sidecar container resource configuration.

Autopilot max limit for 8/16 A100 nodes for the sidecard as the max resources is 2 CPU and 14GB.

It worked fine up until yesterday, but of course with these limits it is quite hard to have not enough resources to keep the Dataloaders workers and GPU occupancy high.

Then since yesterday I can always reproduce this error. Adding the suggested debug flags doesn't add any specific details.

@bhack
Copy link
Author

bhack commented Jun 10, 2023

Just an update on this.
I found was a small problem in the POD so I think we need to just give a better message in the sidecar when the problem is from the POD instead of sending all these messages.

Otherwise you have to check "to the millimeter" with all timestamps between the pod and the sidecar

@songjiaxun
Copy link
Contributor

The gcsfuse error E0609 ReadFile: interrupted system call, fh.reader.ReadAt: readFull: context canceled indicates that the read operation was cancelled by the user's application. The gcsfuse did not crash in this case.

As for the resource limit on Autopilot clusters, let's use #35 to track the limitation.

Closing this issue for now as there is no action needed. Feel free to re-open if detailed logs shows that the timestamps between the pod and the sidecar being messed up.

@haposan06
Copy link

haposan06 commented Jan 30, 2024

Hi I encoutered the same issue using the same sidecar architecture in cloudrun. Below are the complete logs

DEFAULT 2024-01-30T08:56:31Z {"seconds":1706604991,"nanos":251195873},"severity":"ERROR","message":"fuse: *fuseops.ReadDirOp error: interrupted system call"}
DEFAULT 2024-01-30T08:56:31Z {"seconds":1706604991,"nanos":251118283},"severity":"ERROR","message":"ReadDir: interrupted system call, readAllEntries: ReadEntries: read objects: ListObjects: Error in iterating through objects: context canceled"}
INFO 2024-01-30T08:56:20.991183Z [protoPayload.serviceName: run.googleapis.com] [protoPayload.methodName: v1] [protoPayload.resourceName: namespaces/spse-prod/revisions/spse-kabupaten-pegunungan-bintang-00030-wtp] Ready condition status changed to True for Revision spse-kabupaten-pegunungan-bintang-00030-wtp.
DEFAULT 2024-01-30T08:56:19Z {"seconds":1706604979,"nanos":928807842},"severity":"ERROR","message":"fuse: *fuseops.ReadDirOp error: interrupted system call"}
DEFAULT 2024-01-30T08:56:19Z {"seconds":1706604979,"nanos":928732702},"severity":"ERROR","message":"ReadDir: interrupted system call, readAllEntries: ReadEntries: read objects: ListObjects: Error in iterating through objects: context canceled"}
DEFAULT 2024-01-30T08:56:10.778800Z 2024/01/30 08:56:10 [info] 29#29: *3 client 169.254.1.1 closed keepalive connection

Edit:
It seems it only reproduced in cloud-run. I tried to reproduce in my local docker but it seem fine. One thing to note, I have 3 level of directory hierarchy, and most directories has more than 500 hundreds child directories and more, which resulted in total there are more than 2000 directories. Could be it be a concurrent issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants