WIP: Add rechunking example #197

thodson-usgs · 2024-07-19T19:24:07Z

This PR adds an example script demonstrating how to rechunk a VirtualiZarr dataset with Cubed.

However, this is still a WIP. I'm creating the PR to elicit feedback about what changes might be necessary in order for the script to run as intended. @TomNicholas and @norlandrhagen might have some thoughts.

After creating the combined virtual dataset, I specify the source chunking before passing it off the Cubed for rechunking

source_chunks = {'Time':1, 'south_north':250, 'west_east':320}

combined_chunked = combined_ds.chunk(
    chunks = source_chunks,
)

combined_chunks

returns

Frozen({'Time': (1, 1, 1, 1), 'south_north': (250,), 'west_east': (320,), 'interp_levels': (9,), 'soil_layers_stag': (4,)})

The virtual dataset contains four files, indicated by 'Time': (1, 1, 1, 1).

Then I attempt to rechunk:

from cubed.primitive.rechunk import rechunk

target_chunks = {'Time':5, 'south_north':25, 'west_east':32}

rechunk(
    combined_chunked['TMAX'], # requires shape attr, so can't pass full Dataset
    target_chunks=target_chunks,
    source_array_name='virtual',
    int_array_name='temp',
    allowed_mem=2000,
    reserved_mem=1000,
    target_store="test.zarr",
    #temp_store="s3://cubed-thodson-temp",
)

which errors with

TypeError: can't multiply sequence by non-int of type 'tuple'

Apparently, Cubed won't tolerate the Time chunk tuple 'Time': (1, 1, 1, 1). Is there a simple way to convert it to Time': (1, )? Alternatively, I could prepare a PR to Cubed, which would set the memory constraint around the largest chunk size when chunks are variable.

kerchunk

I also tested this workflow with kerchunk but I ran into a bug while following the Pythia cookbook example:

/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/combine.py:370: UserWarning: Concatenated coordinate 'Time' contains less than expected number of values across the datasets: [0]
  warnings.warn(

TomNicholas · 2024-07-20T00:00:25Z

Great to see you trying this! First time someone has tried to use these two libraries together! I don't think you should import the rechunk primitive from Cubed. I think instead you should open the kerchunked dataset as an Xarray dataset using cubed-xarray, then call Xarray's chunk method with the desired chunks. This might smooth out low-level chunk type considerations for you, and if it doesn't that's a bug.

…

On Fri, Jul 19, 2024, 12:24 PM Timothy Hodson ***@***.***> wrote: This PR adds an example script demonstrating how to rechunk a VirtualiZarr dataset with Cubed. However, this is still a WIP. I'm creating the PR to elicit feedback about what changes might be necessary in order for the script to run as intended. @TomNicholas <https://github.com/TomNicholas> and @norlandrhagen <https://github.com/norlandrhagen> might have some thoughts. After creating the combined virtual dataset, I specify the source chunking before passing it off the Cubed for rechunking source_chunks = {'Time':1, 'south_north':250, 'west_east':320} combined_chunked = combined_ds.chunk( chunks = source_chunks, ) combined_chunks returns Frozen({'Time': (1, 1, 1, 1), 'south_north': (250,), 'west_east': (320,), 'interp_levels': (9,), 'soil_layers_stag': (4,)}) The virtual dataset contains four files, indicated by 'Time': (1, 1, 1, 1) . Then I attempt to rechunk: from cubed.primitive.rechunk import rechunk target_chunks = {'Time':5, 'south_north':25, 'west_east':32} rechunk( combined_chunked['TMAX'], # requires shape attr, so can't pass full Dataset target_chunks=target_chunks, source_array_name='virtual', int_array_name='temp', allowed_mem=2000, reserved_mem=1000, target_store="test.zarr", #temp_store="s3://cubed-thodson-temp", ) which errors with TypeError: can't multiply sequence by non-int of type 'tuple' Apparently, Cubed won't tolerate the Time chunk tuple 'Time': (1, 1, 1, 1). Is there a simple way to convert it to Time': (1, )? Alternatively, I could prepare a PR to Cubed, which would set the memory constraint around the largest chunk size when chunks are variable. kerchunk I also tested this workflow with kerchunk but I ran into a bug while following the Pythia cookbook example <https://projectpythia.org/kerchunk-cookbook/notebooks/foundations/02_kerchunk_multi_file.html> : /home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/combine.py:370: UserWarning: Concatenated coordinate 'Time' contains less than expected number of values across the datasets: [0] warnings.warn( ------------------------------ You can view, comment on, or merge this pull request online at: #197 Commit Summary - 5b3a12f <5b3a12f> Add rechunking example File Changes (4 files <https://github.com/zarr-developers/VirtualiZarr/pull/197/files>) - *A* examples/rechunking/Dockerfile_virtualizarr <https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-e33d97ab70b867b634eac16c235c7e9795dbc2115bb3bade311c6d46c12e59d8> (59) - *A* examples/rechunking/README.md <https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-8ace617142d2db0b36c6513b371549f3d6b2d7671d2eb4bb511f015ed9c6b406> (15) - *A* examples/rechunking/cubed-rechunk.py <https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-ec679ad647666b70abb5797750210430ab4b4e237c35724b0bd8afe818d7ae35> (81) - *A* examples/rechunking/requirements.txt <https://github.com/zarr-developers/VirtualiZarr/pull/197/files#diff-9ca5deb54705a5494a79902c3670b1fd6c9134f20dc2d64d5e11882e8efc8fca> (9) Patch Links: - https://github.com/zarr-developers/VirtualiZarr/pull/197.patch - https://github.com/zarr-developers/VirtualiZarr/pull/197.diff — Reply to this email directly, view it on GitHub <#197>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AISNPI3EWX52ZPQ6EJ4D45LZNFRW3AVCNFSM6AAAAABLFD3DE2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTSNRZGIZDANA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

thodson-usgs · 2024-07-24T21:17:23Z

I don't think you should import the rechunk primitive from Cubed. I think instead you should open the kerchunked dataset as an Xarray dataset using cubed-xarray, then call Xarray's chunk method with the desired chunks.

Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two.

TomNicholas · 2024-07-24T21:36:23Z

Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two.

Great - very curious to see the details.

I think what you're doing here should live in the cubed repo though - once you have the kerchunk reference files on disk virtualizarr is out of the picture, all of the rechunking is about using cubed. I do think that this use case would make an important example to have in the cubed docs though - as its basically showing how the original rechunker package is just a special case of cubed (cc @tomwhite).

tomwhite · 2024-07-25T11:40:44Z

Sounds great. Happy for this to be added as a Cubed example.

thodson-usgs · 2024-07-25T15:03:48Z

Closing and opening a PR on cubed cubed-dev/cubed#520.

Add rechunking example

5b3a12f

Fix Dockerfile

4fbe5c7

thodson-usgs temporarily deployed to test-release July 20, 2024 18:17 — with GitHub Actions Inactive

TomNicholas added the usage example Real world use case examples label Jul 21, 2024

Complete rechunk workflow

0beadcb

thodson-usgs temporarily deployed to test-release July 24, 2024 20:12 — with GitHub Actions Inactive

Edit doc

851f6d8

thodson-usgs temporarily deployed to test-release July 24, 2024 20:38 — with GitHub Actions Inactive

thodson-usgs closed this Jul 25, 2024

TomNicholas mentioned this pull request Jul 25, 2024

MetadataError from ValueError: Could not convert object to NumPy datetime #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add rechunking example #197

WIP: Add rechunking example #197

thodson-usgs commented Jul 19, 2024

TomNicholas commented Jul 20, 2024 via email

thodson-usgs commented Jul 24, 2024

TomNicholas commented Jul 24, 2024

tomwhite commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024

WIP: Add rechunking example #197

WIP: Add rechunking example #197

Conversation

thodson-usgs commented Jul 19, 2024

kerchunk

TomNicholas commented Jul 20, 2024 via email

thodson-usgs commented Jul 24, 2024

TomNicholas commented Jul 24, 2024

tomwhite commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024