Drastic speed difference between GCS and S3 #395

mpiannucci · 2023-11-17T15:50:35Z

mpiannucci
Nov 17, 2023

I am not sure if this is an fsspec issue or a kerchunk one (probably the former) but I am posting here because it is affecting the kerchunking workflow.

I have the same kerchunked dataset created pointing at s3 and gcs:

s3: s3://nextgen-dmac/kerchunk/hrrr_subhourly.json
gcs: gs://squall-hrrr/hrrr_subhourly.json

The s3 version has references to chunks in s3, the gcs version has references to chunks in gcs.

The notebook with the comparison is available here: https://github.com/mpiannucci/ocean-notebooks/blob/main/hrrr_timeseries.ipynb

The difference in proccessing time is staggering. There are 72 references per variable, so we are loading time, u, and v, then doing compute. That is the time to download and transform 216 references.

Load Dataset

Filesystem	Time (s)
S3	0.5
GCS	6.9

Extract Wind Speed and Direction Timeseries

Filesystem	Time (s)
S3	2.7
GCS	48

However, when I cat the chunks and json there is no difference in speed.

Note I am using my custom grib parser, but tested the same datasets with the cfgrib codec and had the same results.

Thanks for any help!

mpiannucci · 2023-11-17T17:50:34Z

mpiannucci
Nov 17, 2023
Author

cfgrib codec versions:

s3: s3://nextgen-dmac/kerchunk/hrrr_subhourly_cfgrib.json

gcs: gcs://squall-hrrr/hrrr_subhourly_cfgrib.json

0 replies

martindurant · 2023-11-17T17:53:05Z

martindurant
Nov 17, 2023
Maintainer

Where do I get the "gribberish" codec?

1 reply

mpiannucci Nov 17, 2023
Author

pip install git+https://github.com/mpiannucci/gribberish.git#egg=gribberish&subdirectory=python

martindurant · 2023-11-17T18:00:52Z

martindurant
Nov 17, 2023
Maintainer

I got for GCS:

In [1]: import fsspec
In [2]: import xarray as xr
In [3]: %time fs = fsspec.filesystem("reference", fo="gcs://squall-hrrr/hrrr_subhourly_cfgrib.json")
Wall time: 952 ms
In [4]: %time ds = xr.open_dataset(fs.get_mapper(), engine="zarr")
Wall time: 699 ms

(where initial connection contains extra latency for auth and ssl negotiation)

for S3:

In [1]: import fsspec
In [2]: import xarray as xr
In [3]: %time fs = fsspec.filesystem("reference", fo="s3://nextgen-dmac/kerchunk/hrrr_subhourly_cfgrib.json", target_options={"anon": True})
Wall time: 620 ms
In [4]: %time ds = xr.open_dataset(fs.get_mapper(), engine="zarr")
Wall time: 907 ms

.. so, the same?

2 replies

mpiannucci Nov 17, 2023
Author

That is suprising. Did you get the same when you download a timeseries subset?

martindurant Nov 17, 2023
Maintainer

No, I didn't get that far

emfdavid · 2023-11-17T18:09:21Z

emfdavid
Nov 17, 2023

I was able to run the notebook from GCS (us-central1)
Here are the timings I see for opening the datasets and extracting the timeseries

12 replies

emfdavid Nov 17, 2023

I missed the punch line - what exactly was the issue - no snafu is too dumb for me to hit again on my own I promise!

mpiannucci Nov 17, 2023
Author

AWS wasn't throwing error on cred error and instead just skipped the chunks, thus the faster chunk fetching times being way too fast

martindurant Nov 17, 2023
Maintainer

In PR currently, s3fs will "fallback" and try anon access on credentials error. It will work for no-credentials, but maybe not expired-credentials fsspec/s3fs#823

mpiannucci Nov 17, 2023
Author

Great, that would have cured this so thankful you have a PR for it in the future! Thanks again

emfdavid Nov 17, 2023

Thanks for working on the optimization @mpiannucci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drastic speed difference between GCS and S3 #395

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Drastic speed difference between GCS and S3 #395

mpiannucci Nov 17, 2023

Replies: 4 comments · 15 replies

mpiannucci Nov 17, 2023 Author

martindurant Nov 17, 2023 Maintainer

mpiannucci Nov 17, 2023 Author

martindurant Nov 17, 2023 Maintainer

mpiannucci Nov 17, 2023 Author

martindurant Nov 17, 2023 Maintainer

emfdavid Nov 17, 2023

emfdavid Nov 17, 2023

mpiannucci Nov 17, 2023 Author

martindurant Nov 17, 2023 Maintainer

mpiannucci Nov 17, 2023 Author

emfdavid Nov 17, 2023

mpiannucci
Nov 17, 2023

Replies: 4 comments 15 replies

mpiannucci
Nov 17, 2023
Author

martindurant
Nov 17, 2023
Maintainer

mpiannucci Nov 17, 2023
Author

martindurant
Nov 17, 2023
Maintainer

mpiannucci Nov 17, 2023
Author

martindurant Nov 17, 2023
Maintainer

emfdavid
Nov 17, 2023

mpiannucci Nov 17, 2023
Author

martindurant Nov 17, 2023
Maintainer

mpiannucci Nov 17, 2023
Author