Sidecar crash #36

bhack · 2023-06-09T10:30:58Z

Recently the sidecar started to constantly crash after few minutes the pod started with this repeating error:

E0609 LookUpInode: interrupted system call, list objects: Error in iterating through objects: context canceled
E0609 fuse: *fuseops.LookUpInodeOp error: interrupted system call
E0609 LookUpInode: interrupted system call, list objects: Error in iterating through objects: context canceled
E0609  fuse: *fuseops.LookUpInodeOp error: interrupted system call
E0609  ReadFile: interrupted system call, fh.reader.ReadAt: readFull: context canceled
E0609  *fuseops.ReadFileOp error: interrupted system call

Dirver version
Running Google Cloud Storage FUSE CSI driver sidecar mounter version v0.1.3-gke.0

GKE
1.26.3-gke.1000

The text was updated successfully, but these errors were encountered:

songjiaxun · 2023-06-09T18:01:25Z

Thanks for the question.

Could you share the following information with us:

The sidecar container resource configuration.
What is the IO pattern in your workload before the crash happen? Is the workload trying to ls a folder with many files?

As this issue is actually cased by the underlaying gcsfuse, and it is also tacked by GoogleCloudPlatform/gcsfuse#402, we will work with the GCSFuse team to troubleshoot and fix it.

bhack · 2023-06-09T18:14:06Z

The sidecar container resource configuration.

Autopilot max limit for 8/16 A100 nodes for the sidecard as the max resources is 2 CPU and 14GB.

It worked fine up until yesterday, but of course with these limits it is quite hard to have not enough resources to keep the Dataloaders workers and GPU occupancy high.

Then since yesterday I can always reproduce this error. Adding the suggested debug flags doesn't add any specific details.

bhack · 2023-06-10T11:27:26Z

Just an update on this.
I found was a small problem in the POD so I think we need to just give a better message in the sidecar when the problem is from the POD instead of sending all these messages.

Otherwise you have to check "to the millimeter" with all timestamps between the pod and the sidecar

songjiaxun · 2023-07-14T20:12:25Z

The gcsfuse error E0609 ReadFile: interrupted system call, fh.reader.ReadAt: readFull: context canceled indicates that the read operation was cancelled by the user's application. The gcsfuse did not crash in this case.

As for the resource limit on Autopilot clusters, let's use #35 to track the limitation.

Closing this issue for now as there is no action needed. Feel free to re-open if detailed logs shows that the timestamps between the pod and the sidecar being messed up.

haposan06 · 2024-01-30T09:56:37Z

Hi I encoutered the same issue using the same sidecar architecture in cloudrun. Below are the complete logs

DEFAULT 2024-01-30T08:56:31Z {"seconds":1706604991,"nanos":251195873},"severity":"ERROR","message":"fuse: *fuseops.ReadDirOp error: interrupted system call"}
DEFAULT 2024-01-30T08:56:31Z {"seconds":1706604991,"nanos":251118283},"severity":"ERROR","message":"ReadDir: interrupted system call, readAllEntries: ReadEntries: read objects: ListObjects: Error in iterating through objects: context canceled"}
INFO 2024-01-30T08:56:20.991183Z [protoPayload.serviceName: run.googleapis.com] [protoPayload.methodName: v1] [protoPayload.resourceName: namespaces/spse-prod/revisions/spse-kabupaten-pegunungan-bintang-00030-wtp] Ready condition status changed to True for Revision spse-kabupaten-pegunungan-bintang-00030-wtp.
DEFAULT 2024-01-30T08:56:19Z {"seconds":1706604979,"nanos":928807842},"severity":"ERROR","message":"fuse: *fuseops.ReadDirOp error: interrupted system call"}
DEFAULT 2024-01-30T08:56:19Z {"seconds":1706604979,"nanos":928732702},"severity":"ERROR","message":"ReadDir: interrupted system call, readAllEntries: ReadEntries: read objects: ListObjects: Error in iterating through objects: context canceled"}
DEFAULT 2024-01-30T08:56:10.778800Z 2024/01/30 08:56:10 [info] 29#29: *3 client 169.254.1.1 closed keepalive connection

Edit:
It seems it only reproduced in cloud-run. I tried to reproduce in my local docker but it seem fine. One thing to note, I have 3 level of directory hierarchy, and most directories has more than 500 hundreds child directories and more, which resulted in total there are more than 2000 directories. Could be it be a concurrent issue?

bhack mentioned this issue Jun 9, 2023

fh.reader.ReadAt: readFull: context canceled GoogleCloudPlatform/gcsfuse#402

Closed

songjiaxun closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar crash #36

Sidecar crash #36

bhack commented Jun 9, 2023 •

edited

Loading

songjiaxun commented Jun 9, 2023

bhack commented Jun 9, 2023 •

edited

Loading

bhack commented Jun 10, 2023 •

edited

Loading

songjiaxun commented Jul 14, 2023

haposan06 commented Jan 30, 2024 •

edited

Loading

Sidecar crash #36

Sidecar crash #36

Comments

bhack commented Jun 9, 2023 • edited Loading

songjiaxun commented Jun 9, 2023

bhack commented Jun 9, 2023 • edited Loading

bhack commented Jun 10, 2023 • edited Loading

songjiaxun commented Jul 14, 2023

haposan06 commented Jan 30, 2024 • edited Loading

bhack commented Jun 9, 2023 •

edited

Loading

bhack commented Jun 9, 2023 •

edited

Loading

bhack commented Jun 10, 2023 •

edited

Loading

haposan06 commented Jan 30, 2024 •

edited

Loading