-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray.concat fails due to inconsistent leap year shapes #330
Comments
Hey there @kjdoore, Thanks for trying out VirtualiZarr and opening up a clear MRE. The ZEP003 Variable Length Chunks size limitation is known. There is some discussion around it in a few issues in this repo and in the zarr-python repo, including recently this issue in the icechunk repo.. A bit ago I made a VirtualiZarr reference to Gridmet that points to the data on the ClimatologyLab's site. Happy to share the reference and a gist for generating it if that's helpful. |
Yes, unfortunately @kjdoore this is a fundamental limitation of the whole "virtual dataset" approach (that chunk sizes must be compatible with the zarr data model, which currently requires regular-length chunks), so whilst we are planning to implement this, it's going to take many months at least until this works. |
xref #12 |
That is what I figured and wanted to confirm. Thanks for the info and I am looking forward to this eventually being implemented.
@norlandrhagen, I have been able to make a virtual Zarr store for the gridmet data using Kerchunk, but what hoping to also be able to do so with VirtualiZarr. If you have an example using VirtualiZarr and would be willing to share, it would be much appreciated! |
FYI Kerchunk has this same limitation around variable-length chunks. |
@kjdoore here is a gist: https://gist.github.com/norlandrhagen/dc4da9aad6ceb52ec4871b9689e5e5aa |
I have looked into this further and have found that the issue lies in the chunking of the time coordinate and not the data variables. The variables in the gridMET data are all uniformly chunked and should have no problem being virtualized. Even with the leap year included, the NetCDFs are consistently chunked. However, the coordinates are not chunked, meaning for the time coordinate, leap years have different "chunking" than non-leap years. This is where VirtualiZarr gets snagged. Since the coordinates have variable length chunks they cannot be concatenated, but if they were uniformly chunked, even if some were partial chunks, they could be. To help explain this, I have made a gist of a toy example: https://gist.github.com/kjdoore/7b1dc18c5459cb56482278bc8198517e I find this interesting as it is typical to not chunk the coordinates and only the data variables. Therefore, I would expect a situation like this to not be a problem. Let me know what you think |
I think that anything this is possible in the file format will eventually be exercised by a real dataset. Every weird edge case will eventually come up if we process enough data. |
Wouldn't you want to load the |
Thanks for looking deeper @kjdoore . FYI all that matters is the
If I understand correctly you could just load the vds = open_virtual_dataset(<file>, loadable_variables=['time'], decode_times=True, indexes={}) If that doesn't work then let me know and I'll try running your example + notebook. If that does work we should make note of it in the docs on |
@TomNicholas, that did the trick. Loading in the time coordinate in with If you would like me to open up a PR with some updates to the docs on this, I'd be happy to |
Great!
That would be amazing thank you @kjdoore - I basically just want to add this case to the list of scenarios in which you might want to use |
As a final question, is there a reason one would not want to always include all coordinates in the |
Good question. One answer is that anything duplicated could become out of sync with the referenced original files, but I think now that icechunk exists we never want anyone to try to update values in a file without creating (and committing) new virtual references for the affected chunks. The main reason though is that coordinates can be N-dimensional in general, in which case you might use a lot of storage duplicating them. But for low-dimensional coordinates I agree we should recommend people load them. Perhaps we should even add another option: Also note that the kwarg is called |
I am trying to make a virtual Zarr store for some daily gridMET data. The data are stored in yearly NetCDF files, which results in some data files having an additional day due to leap years. I can read them in as virtual datasets, but when I go to concatenate them, an error is thrown saying the arrays have inconsistent chunk shapes.
I have checked and confirmed all NetCDF files have the same chunking and have a chunk shape along the
day
dimension of61
. Is this error actually due to the chunk shapes, or is it due to the inconsistent data shape between files? Any help or insight would be appreciated!Here is a minimal reproducible example:
The text was updated successfully, but these errors were encountered: