diff --git a/docs/installation.md b/docs/installation.md index 73ea41a4e..2579c9d36 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -54,10 +54,10 @@ limitations under the License. ``` ## Install -- Run the following command to install the latest driver with version `v0.1.13`. The driver will be installed under a new namespace `gcs-fuse-csi-driver`. The installation may take a few minutes. +- Run the following command to install the latest driver with version `v1.2.0`. The driver will be installed under a new namespace `gcs-fuse-csi-driver`. The installation may take a few minutes. ```bash # Replace with your cluster project ID. - make install STAGINGVERSION=v0.1.13 PROJECT= + make install STAGINGVERSION=v1.2.0 PROJECT= ``` - If you would like to build your own images, follow the [Cloud Storage FUSE CSI Driver Development Guide](development.md) to build and push the images. Run the following command to install the driver. diff --git a/docs/known-issues.md b/docs/known-issues.md index 268720591..c8ca2b57f 100644 --- a/docs/known-issues.md +++ b/docs/known-issues.md @@ -31,12 +31,16 @@ After the CSI driver creates the mount point, it will inform kubelet to proceed In the sidecar container, which is an unprivileged container, a process connects to the UDS and calls [recvmsg(2)](https://man7.org/linux/man-pages/man2/recvmsg.2.html) to receive the file descriptor. Then the process calls Cloud Storage FUSE passing the file descriptor to start to serve the FUSE mount point. Instead of passing the actual mount point path, we pass the file descriptor to Cloud Storage FUSE as it supports the [magic /dev/fd/N syntax](https://github.com/GoogleCloudPlatform/gcsfuse/blob/8ab11cd07016a247f64023697383c6e88bc022b0/vendor/github.com/jacobsa/fuse/mount_linux.go#L128-L134). Before the Cloud Storage FUSE takes over the file descriptor, any operations against the mount point will hang. +Since the CSI driver sets `requiresRepublish: true`, it periodically checks whether the GCSFuse volume is still needed by the containers. When the CSI driver detects all the main workload containers have terminated, it creates an exit file in a Pod emptyDir volume to notify the sidecar container to terminate. + ### Implications of the sidecar container design Until the Cloud Storage FUSE takes over the file descriptor, the mount point is not accessible. Any operations against the mount point will hang, including [stat(2)](https://man7.org/linux/man-pages/man2/lstat.2.html) that is used to check if the mount point exists. The sidecar container, or more precisely, the Cloud Storage FUSE process that serves the mount point needs to remain running for the full duration of the Pod's lifecycle. If the Cloud Storage FUSE process is killed, the workload application will throw IO error `Transport endpoint is not connected`. +The sidecar container auto-termination depends on Kubernetes API correctly reporting the Pod status. However, due to a [Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/106896), container status is not updated after termination caused by Pod deletion. As a result, the sidecar container may not automatically terminate in some scenarios. + ### Issues - [The CSI driver does not support volumes for initContainers](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/38) @@ -44,24 +48,16 @@ The sidecar container, or more precisely, the Cloud Storage FUSE process that se - [subPath does not work when Anthos Service Mesh is enabled](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/47) - ["Error: context deadline exceeded" when Anthos Service Mesh is enabled](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/46) - [The sidecar container does not work well with istio-proxy sidecar container](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/53) +- [The sidecar container does not respect terminationGracePeriodSeconds when the Pod restartPolicy is OnFailure or Always](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/168) ### Solutions -Unfortunately, there is no good short-term solution or workaround for the above issues due to the restrictions of the sidecar container mode design. - -The [sidecar containers KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/753-sidecar-containers) is implemented in this PR [Add SidecarContainers feature](https://github.com/kubernetes/kubernetes/pull/116429). +The GCS FUSE SCI Driver now utilizes the [Kubernetes native sidecar container feature](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/753-sidecar-containers), available in GKE versions 1.29.3-gke.1093000 or later. -> The new feature gate "SidecarContainers" is now available. This feature introduces sidecar containers, a new type of init container that starts before other containers but remains running for the full duration of the pod's lifecycle and will not block pod termination. +The Kubernetes native sidecar container feature introduces sidecar containers, a new type of init container that starts before other containers but remains running for the full duration of the pod's lifecycle and will not block pod termination. -This new feature is a good long-term solution. Instead of injecting the sidecar container as a regular container, we will leverage the new SidecarContainers feature to inject the container as an init container, so that other non-sidecar init container can also use the CSI driver. - -We are currently testing the SidecarContainers feature, and will adopt the feature when it is available on GKE. +Instead of injecting the sidecar container as a regular container, the sidecar container is now injected as an init container, so that other non-sidecar init containers can also use the CSI driver. Moreover, the sidecar container lifecycle, such as auto-termination, is managed by Kubernetes. ## Issues in Autopilot clusters - [Resource limitation for the sidecar container on Autopilot using GPU: 2 CPU and 14GB Memory](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/35) -- [Cannot upload files larger than 10Gi in Autopilot clusters](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/21) - -## Other issues - -- [Multiple PVs referring to the same bucket does not work](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/48) diff --git a/docs/releases.md b/docs/releases.md index cec70623c..931e06eaa 100644 --- a/docs/releases.md +++ b/docs/releases.md @@ -34,6 +34,7 @@ limitations under the License. | [v0.1.12](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/releases/tag/v0.1.12) | Released | 2024-01-25 | [v1.4.0](https://github.com/GoogleCloudPlatform/gcsfuse/releases/tag/v1.4.0) | [7898e40bf57f](https://gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter@sha256:7898e40bf57f159dc828511f4217cb42c08fa4df0c9ad732a0b0747b66e415c6) | None | 1.25.16-gke.1268000 | 1.26.12-gke.1111000 | 1.27.9-gke.1092000 | None | 1.29.0-gke.1381000 | | [v0.1.13](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/releases/tag/v0.1.13) | Released | 2024-02-08 | [v1.4.1](https://github.com/GoogleCloudPlatform/gcsfuse/releases/tag/v1.4.1) | [972699a4bf89](https://gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter@sha256:972699a4bf8973f7614f09908412a1fca24ea939eac2d3fcca599109f71fc162) | None | 1.25.16-gke.1360000 | 1.26.13-gke.1052000 | 1.27.10-gke.1055000 | 1.28.6-gke.1095000 | 1.29.1-gke.1425000 | | [v0.1.14](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/releases/tag/v0.1.14) | Released | 2024-02-20 | [v1.4.1](https://github.com/GoogleCloudPlatform/gcsfuse/releases/tag/v1.4.1) | [c83609ecf50d](https://gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter@sha256:c83609ecf50d05a141167b8c6cf4dfe14ff07f01cd96a9790921db6748d40902) | None | 1.25.16-gke.1537000 | 1.26.14-gke.1006000 | 1.27.11-gke.1018000 | 1.28.6-gke.1456000 | 1.29.2-gke.1060000 | +| [v1.2.0](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/releases/tag/v1.2.0) | Released | 2024-04-04 | [v2.0.0](https://github.com/GoogleCloudPlatform/gcsfuse/releases/tag/v2.0.0) | [31880114306b](https://gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter@sha256:31880114306b1fb5d9e365ae7d4771815ea04eb56f0464a514a810df9470f88f) | None | TBD | TBD | TBD | TBD | 1.29.3-gke.1093000 | > Note: The above GKE versions may not be valid any more, please follow the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels#what_versions_are_available_in_a_channel) to check what versions are available in a channel. @@ -41,6 +42,16 @@ The new CSI driver version will be first available in GKE Rapid channel on its r ## Releases +### v1.2.0 + +- Update gcsfuse to v2.0.0. +- Update golang version to 1.22.2. +- Add GCSFuse file cache features. +- Add volume attributes supports. +- Adopt Kubernetes native sidecar container features in GKE 1.29 to support init container volume mounting. Fix the [issue](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/168) where the sidecar container does not respect terminationGracePeriodSeconds when the Pod restartPolicy is OnFailure or Always. +- Add a rate limiter to the CSI node server to avoid GCP API throttling errors. +- Refactor code to increase stability and readability. + ### v0.1.14 - Fix sidecar container auto-termination logic for Pods with restart policy OnFailure. @@ -84,15 +95,15 @@ This release is abandoned. - Updated go modules. - Updated gcsfuse version to v1.2.1-gke.0. - Updated CSI driver golang builder version to go1.21.4. -- Allow users to override sidecar grace-period to fix https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/91. -- Add CSI fsgroup delegation support to fix https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/16. +- Allow users to override sidecar grace-period to fix . +- Add CSI fsgroup delegation support to fix . ### v0.1.6 - Updated go modules. - Updated sidecar container versions. - Updated CSI driver golang builder version to go1.21.2. -- Make the sidecar container follow the [Restricted Pod Security Standard](https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted), setting securityContext.capabilities.drop=["ALL"] to fix the issue https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/52 +- Make the sidecar container follow the [Restricted Pod Security Standard](https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted), setting securityContext.capabilities.drop=["ALL"] to fix the issue - Fixed the behavior when users pass "0" to the pod annotation to configure the sidecar container resources, allowing the sidecar container to consume unlimited resources on Standard clusters. - Fixed sidecar container validation logic in webhook. @@ -131,7 +142,7 @@ This release is abandoned. - Fixed copyright information. - Updated documentation. - Added ARM node support. -- Fixed issue https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/23. +- Fixed issue . - Fixed other issues. ### v0.1.2 @@ -151,4 +162,4 @@ This release is abandoned. ### v0.1.0 -- Initial alpha release of the Google Cloud Storage FUSE CSI Driver. \ No newline at end of file +- Initial alpha release of the Google Cloud Storage FUSE CSI Driver. diff --git a/docs/terraform.md b/docs/terraform.md index f6113ec3c..9057d8cf9 100644 --- a/docs/terraform.md +++ b/docs/terraform.md @@ -21,7 +21,7 @@ If you are using Terraform to create GKE clusters, use `gcs_fuse_csi_driver_conf The following example is a `.tf` file excerpt showing how to enable the CSI driver, GKE Workload Identity, and GKE Metadata Server: -``` +```terraform resource "google_container_cluster" "primary" { # Enable GKE Workload Identity. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 53245d827..d5f928fed 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -22,22 +22,22 @@ limitations under the License. Run the following queries on GCP Logs Explorer to check logs. - Sidecar container and gcsfuse logs: - - ``` + + ```text resource.type="k8s_container" resource.labels.container_name="gke-gcsfuse-sidecar" ``` - Cloud Storage FUSE CSI Driver logs: - - ``` + + ```text resource.type="k8s_container" resource.labels.container_name="gcs-fuse-csi-driver" ``` - Cloud Storage FUSE CSI Driver Webhook logs (only for manual installation users): - - ``` + + ```text resource.type="k8s_container" resource.labels.container_name="gcs-fuse-csi-driver-webhook" ``` @@ -60,34 +60,102 @@ Run the following queries on GCP Logs Explorer to check logs. If your workload Pods cannot start up, please run `kubectl describe pod -n ` to check the Pod events. Find the troubleshooting guide below according to the Pod event. -- Pod event warning: `MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name gcsfuse.csi.storage.gke.io not found in the list of registered CSI drivers`, or Pod event warning: `MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name gcsfuse.csi.storage.gke.io not found in the list of registered CSI drivers` +### CSI driver enablement issues + +- Pod event warning examples: + + - > MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name gcsfuse.csi.storage.gke.io not found in the list of registered CSI drivers + + - > MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name gcsfuse.csi.storage.gke.io not found in the list of registered CSI drivers + +- Solutions: This warning indicates that the CSI driver is not enabled, or the CSI driver is not up and running. Please double check if the CSI driver is enabled on your cluster. See [Enable the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#enable) for details. If the CSI is enabled, on each node you should see a Pod called `gcsfusecsi-node-xxxxx` up and running. If the cluster was just scaled, updated, or upgraded, this warning is normal and should be transient because it takes a few minutes for the CSI driver Pods to be functional after the cluster operations. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = Unauthenticated desc = failed to prepare storage service: storage service manager failed to setup service: timed out waiting for the condition` +### MountVolume.SetUp failures + +> Note: the rpc error code can be used to triage `MountVolume.SetUp` issues. For example, `Unauthenticated` and `PermissionDenied` usually mean the authentication was not configured correctly. A rpc error code `Internal` means that unexpected issues occurred in the CSI driver, please create a [new issue](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/new) on the GitHub project page. + +#### Unauthenticated +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = Unauthenticated desc = failed to prepare storage service: storage service manager failed to setup service: timed out waiting for the condition + +- Solutions: + After you follow the documentation [Configure access to Cloud Storage buckets using GKE Workload Identity](./authentication.md) to configure the Kubernetes service account, it usually takes a few minutes for the credentials being propagated. Whenever the credentials are propagated into the Kubernetes cluster, this warning will disappear, and your Pod scheduling should continue. If you still see this warning after 5 minutes, please double check the documentation [Configure access to Cloud Storage buckets using GKE Workload Identity](./authentication.md) to make sure your Kubernetes service account is set up correctly. Make sure your workload Pod is using the Kubernetes service account in the same namespace. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = PermissionDenied desc = failed to get GCS bucket "xxx": googleapi: Error 403: xxx@xxx.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., forbidden` - +#### PermissionDenied + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = PermissionDenied desc = failed to get GCS bucket "xxx": googleapi: Error 403: xxx@xxx.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., forbidden + +- Solutions: + Please double check the documentation [Configure access to Cloud Storage buckets using GKE Workload Identity](./authentication.md) to make sure your Kubernetes service account is set up correctly. Make sure your workload Pod is using the Kubernetes service account in the same namespace. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = NotFound desc = failed to get GCS bucket "xxx": storage: bucket doesn't exist` - +#### NotFound + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = NotFound desc = failed to get GCS bucket "xxx": storage: bucket doesn't exist + +- Solutions: + The Cloud Storage bucket does not exist. Make sure the Cloud Storage bucket is created, and the Cloud Storage bucket name is specified correctly. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = FailedPrecondition desc = failed to find the sidecar container in Pod spec` - +#### FailedPrecondition + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = FailedPrecondition desc = failed to find the sidecar container in Pod spec + +- Solutions: + The Cloud Storage FUSE sidecar container was not injected. Please check the Pod annotation `gke-gcsfuse/volumes: "true"` is set correctly. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = InvalidArgument desc = the sidecar container failed with error: Incorrect Usage. flag provided but not defined: -xxx` +#### InvalidArgument + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = InvalidArgument desc = the sidecar container failed with error: Incorrect Usage. flag provided but not defined: -xxx + +- Solutions: Invalid mount flags are passed to Cloud Storage FUSE. Please check [Configure how Cloud Storage FUSE buckets are mounted](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#mounting-flags) for more details. -- Pod event warning: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed` +#### ResourceExhausted + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = ResourceExhausted desc = the sidecar container terminated due to OOMKilled, exit code: 137 + +- Solutions: The gcsfuse process was killed, which is usually caused by OOM. Please consider increasing the sidecar container memory limit by using the annotation `gke-gcsfuse/memory-limit`. -- Other Pod event warnings: `MountVolume.SetUp failed for volume "xxx" : rpc error: code = Internal desc = xxx` or `UnmountVolume.TearDown failed for volume "xxx" : rpc error: code = Internal desc = xxx` - +#### Aborted + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit: xxx + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = Aborted desc = An operation with the given volume key xxx already exists + +- Solutions: + + The volume mount operation was aborted due to rate limit or existing operations. This warning is normal and should be transient. + +#### Internal + +- Pod event warning examples: + + - > MountVolume.SetUp failed for volume "xxx" : rpc error: code = Internal desc = xxx` or `UnmountVolume.TearDown failed for volume "xxx" : rpc error: code = Internal desc = xxx + +- Solutions: + Warnings that are not listed above and include a rpc error code `Internal` mean that other unexpected issues occurred in the CSI driver, please create a [new issue](https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/new) on the GitHub project page. Please include your workload information as detailed as possible, and the Pod event warning in the issue.