-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add rechunking example #197
WIP: Add rechunking example #197
Conversation
Great to see you trying this! First time someone has tried to use these two
libraries together!
I don't think you should import the rechunk primitive from Cubed. I think
instead you should open the kerchunked dataset as an Xarray dataset using
cubed-xarray, then call Xarray's chunk method with the desired chunks. This
might smooth out low-level chunk type considerations for you, and if it
doesn't that's a bug.
…On Fri, Jul 19, 2024, 12:24 PM Timothy Hodson ***@***.***> wrote:
This PR adds an example script demonstrating how to rechunk a VirtualiZarr
dataset with Cubed.
However, this is still a WIP. I'm creating the PR to elicit feedback about
what changes might be necessary in order for the script to run as intended.
@TomNicholas <https://github.com/TomNicholas> and @norlandrhagen
<https://github.com/norlandrhagen> might have some thoughts.
After creating the combined virtual dataset, I specify the source chunking
before passing it off the Cubed for rechunking
source_chunks = {'Time':1, 'south_north':250, 'west_east':320}
combined_chunked = combined_ds.chunk(
chunks = source_chunks,
)
combined_chunks
returns
Frozen({'Time': (1, 1, 1, 1), 'south_north': (250,), 'west_east': (320,), 'interp_levels': (9,), 'soil_layers_stag': (4,)})
The virtual dataset contains four files, indicated by 'Time': (1, 1, 1, 1)
.
Then I attempt to rechunk:
from cubed.primitive.rechunk import rechunk
target_chunks = {'Time':5, 'south_north':25, 'west_east':32}
rechunk(
combined_chunked['TMAX'], # requires shape attr, so can't pass full Dataset
target_chunks=target_chunks,
source_array_name='virtual',
int_array_name='temp',
allowed_mem=2000,
reserved_mem=1000,
target_store="test.zarr",
#temp_store="s3://cubed-thodson-temp",
)
which errors with
TypeError: can't multiply sequence by non-int of type 'tuple'
Apparently, Cubed won't tolerate the Time chunk tuple 'Time': (1, 1, 1, 1).
Is there a simple way to convert it to Time': (1, )? Alternatively, I
could prepare a PR to Cubed, which would set the memory constraint around
the largest chunk size when chunks are variable.
kerchunk
I also tested this workflow with kerchunk but I ran into a bug while
following the Pythia cookbook example
<https://projectpythia.org/kerchunk-cookbook/notebooks/foundations/02_kerchunk_multi_file.html>
:
/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/combine.py:370: UserWarning: Concatenated coordinate 'Time' contains less than expected number of values across the datasets: [0]
warnings.warn(
------------------------------
You can view, comment on, or merge this pull request online at:
#197
Commit Summary
- 5b3a12f
<5b3a12f>
Add rechunking example
File Changes
(4 files <https://github.com/zarr-developers/VirtualiZarr/pull/197/files>)
- *A* examples/rechunking/Dockerfile_virtualizarr
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-e33d97ab70b867b634eac16c235c7e9795dbc2115bb3bade311c6d46c12e59d8>
(59)
- *A* examples/rechunking/README.md
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-8ace617142d2db0b36c6513b371549f3d6b2d7671d2eb4bb511f015ed9c6b406>
(15)
- *A* examples/rechunking/cubed-rechunk.py
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-ec679ad647666b70abb5797750210430ab4b4e237c35724b0bd8afe818d7ae35>
(81)
- *A* examples/rechunking/requirements.txt
<https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-9ca5deb54705a5494a79902c3670b1fd6c9134f20dc2d64d5e11882e8efc8fca>
(9)
Patch Links:
- https://github.com/zarr-developers/VirtualiZarr/pull/197.patch
- https://github.com/zarr-developers/VirtualiZarr/pull/197.diff
—
Reply to this email directly, view it on GitHub
<#197>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI3EWX52ZPQ6EJ4D45LZNFRW3AVCNFSM6AAAAABLFD3DE2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTSNRZGIZDANA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two. |
Great - very curious to see the details. I think what you're doing here should live in the cubed repo though - once you have the kerchunk reference files on disk virtualizarr is out of the picture, all of the rechunking is about using cubed. I do think that this use case would make an important example to have in the cubed docs though - as its basically showing how the original |
Sounds great. Happy for this to be added as a Cubed example. |
Closing and opening a PR on cubed cubed-dev/cubed#520. |
This PR adds an example script demonstrating how to rechunk a VirtualiZarr dataset with Cubed.
However, this is still a WIP. I'm creating the PR to elicit feedback about what changes might be necessary in order for the script to run as intended. @TomNicholas and @norlandrhagen might have some thoughts.
After creating the combined virtual dataset, I specify the source chunking before passing it off the Cubed for rechunking
returns
The virtual dataset contains four files, indicated by
'Time': (1, 1, 1, 1)
.Then I attempt to rechunk:
which errors with
Apparently, Cubed won't tolerate the Time chunk tuple
'Time': (1, 1, 1, 1)
. Is there a simple way to convert it toTime': (1, )
? Alternatively, I could prepare a PR to Cubed, which would set the memory constraint around the largest chunk size when chunks are variable.kerchunk
I also tested this workflow with
kerchunk
but I ran into a bug while following the Pythia cookbook example: