-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use kerchunk to join variables and ensembles!? #8
Comments
@rsignell-usgs I think this could work. I will have a go at using the file names, as in #4, append the ensemble member number to the variable names, which I think would allow for creating a single virtual dataset. |
I suspect there may be a better way to do this, but for now the following will work to create a virtual dataset across all ensemble members: https://nbviewer.org/gist/peterm790/f50179e19ca54de63896d7340bd5c878 This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available |
Nice to show it can be done! |
@peterm790 nice work making this workflow compact and efficient! I was wondering whether it would be possible to create an |
This would have labels rather than coordinates? I don't see why not; we would need to specify an object dtype and maybe somewhere specify the encoding for this (JSON would be fine). |
@rsignell-usgs I think having the ensembles along a dimension would be neater although I'm not entirely sure if we can add a dimension using kerchunk? I could maybe modify the crs dimension to achieve this? this will work however:
|
It xarray can concat it, kerchunk can :) |
In xarray concat it's possible to specify a new dimension name. |
I added the following test to the codebase to show that kerchunk already supports this: def test_var_dimension(refs):
mzz = MultiZarrToZarr([refs["simple1"], refs["simple_var1"]], remote_protocol="memory",
coo_map={"varname": "VARNAME", "var": "output"})
out = mzz.translate()
z = xr.open_dataset(
"reference://",
backend_kwargs={"storage_options": {"fo": out, "remote_protocol": "memory"},
"consolidated": False},
engine="zarr",
chunks={}
)
assert list(z) == ["output"]
assert list(z.dims) == ["varname", "x", "y"]
assert list(z.varname) == ["data", "datum"]
assert z.output.shape == (2, 10, 10)
assert (z.output.values == arr).all() We say that dimension "varname" is created from the names of the variables in the input, and that the variable name in the output is to be "output". |
Will leave this open pending a successful example more complex than my test case. |
Hopefully @peterm790 can try this out on the National Water Model ensemble members! |
Okay so I have had some success creating the ensemble using the
(Here's the full work flow: https://gist.github.com/peterm790/c8c08a1e23aacfa007844cbc5affef66) I did however have to modify the regex used in I am not sure if that is necessary and could well be due to me being quite new to regex, although perhaps having an option for the user to set Then a final issue is the above results in |
In fact the above does not work as all variables are only Nan. The following will map all coordinates correctly but variables are all still Nan
Which is odd as the mappings within the final json's (for instance ['nudge/0.0.0']) are identical between this and the initial version which renamed the variables using postprocess |
The following regex appears to work: Also I see |
By the way, you may find it helpful in debugging this kind of thing to turn on logging, e.g., fsspec.utils.setup_logging(logger_name="fsspec.reference") # the reference file systems
fsspec.utils.setup_logging(logger_name="s3fs") # S3 calls |
Aah right! Sorry, forgetting to change the "remote_protocol" was a very silly error. In that case the following workflow works to create a virtual dataset from the NWM ensemble: https://nbviewer.org/gist/peterm790/6e4826ed83218c111f71c9fe95e69163 |
Very cool Peter! Just a note to answer a issue I saw you raised in that notebook -- that you can't create the JSON for the new data without actually accessing the files because the byte ranges change. Unless the data stored uses no compression (e.g. NetCDF3), the byte ranges will always be changing because the blocks will compress different amounts as the values change. And even for NetCDF3, the byte ranges would likely change because the metadata will be changing. Make sense? |
@rsignell-usgs : I can't remember, is CRS supposed to be considered a coordinate, or is it "identical" in all inputs? It doesn't matter that it has 300 8-byte chunks, but perhaps it's a little untidy. |
@rsignell-usgs thanks. Yes this makes sense, a pity as it would be pretty neat to automatically generate kerchunks for new model runs @martindurant CRS is identical across each ensemble member (as well as across time) |
And yes, as @peterm790 said, CRS will always be just a single value for a variable, and usually just a single value for the entire dataset. |
Not really sure where I was going with this but just to explore removing CRS as a variable while keeping the CRS attributes I set this up using https://nbviewer.org/gist/peterm790/35ca1858db8981c3d4a234f2f48b5975
Again not really sure if this is necessary but maybe a good demonstration of how to remove variables as well as how to modify attributes using |
@peterm790, please take a look at https://discourse.pangeo.io/t/efficient-access-of-ensemble-data-on-aws/2530 and see if you think the current version of kerchunk could handle it!
The text was updated successfully, but these errors were encountered: