You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've encountered a significant performance bottleneck in our training jobs, specifically when using file listing commands like Path.rglob to enumerate trainable assets stored on gcsi-fuse-csi mounted volumes. This issue becomes particularly evident with datasets of typical size, leading to considerable cold start delays before training can commence.
This latency not only hinders the initial start-up of our training jobs but also poses a substantial challenge when utilizing GKE spot instances. Each time a job is preempted and subsequently restarts from the last saved checkpoint, it incurs this cold start penalty again due to the necessity to re-prepare data loaders.
This recurring overhead directly impacts cost-efficiency and resource utilization, particularly in a dynamic scaling environment where jobs are frequently interrupted and resumed. Addressing this file listing performance issue could significantly reduce start-up times and improve the overall efficiency of training jobs on GKE spot instances.
The text was updated successfully, but these errors were encountered:
We've encountered a significant performance bottleneck in our training jobs, specifically when using file listing commands like
Path.rglob
to enumerate trainable assets stored on gcsi-fuse-csi mounted volumes. This issue becomes particularly evident with datasets of typical size, leading to considerable cold start delays before training can commence.This latency not only hinders the initial start-up of our training jobs but also poses a substantial challenge when utilizing GKE spot instances. Each time a job is preempted and subsequently restarts from the last saved checkpoint, it incurs this cold start penalty again due to the necessity to re-prepare data loaders.
This recurring overhead directly impacts cost-efficiency and resource utilization, particularly in a dynamic scaling environment where jobs are frequently interrupted and resumed. Addressing this file listing performance issue could significantly reduce start-up times and improve the overall efficiency of training jobs on GKE spot instances.
The text was updated successfully, but these errors were encountered: