Maybe Refactor for `CodecPipeline` + `Store` #13

ilan-gold · 2024-09-27T11:48:40Z

A good idea from Norman: https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/zarr-python.20v.20tensorstore.20benchmark.3A.20sharded.20.20read.20.2F.20writes/near/447185735 to use https://github.com/zarr-developers/zarr-python/blob/5ca080d0fd21dbb2b6a8b101cc99d86de5039f08/src/zarr/codecs/pipeline.py#L59 as the entry point for rust

I'm not sure exactly how to go about this since it wasn't documented really at the time of my creating this library (although it seems a sprint is coming soon: zarr-developers/zarr-python#2215) so I went with the thing that looked the most familiar in the absence of something clearer. I don't think much would have to change since the arguments to the rust interface should basically be identical but could be a simpler way to integrate...although reopening the array so rust has the right metadata could be tricky. I don't think Store really handle opening arrays so I don't know how this would actually help with the metadata issue...but maybe there's a trick here I'm missing

The text was updated successfully, but these errors were encountered:

ilan-gold · 2024-09-27T11:59:50Z

This would also have the side-effect of getting rid of the horrifying fork....

ilan-gold · 2024-09-27T12:01:14Z

Another thing about this - falling back to a "working" implementation. Right now it's very easy so the library just "works" whether or not the rust codepath can be used but I'm not sure how we would fall back to a different codec...maybe instantiating a different pipeline within our pipeline and then calling it conditionally?

LDeakin · 2024-09-27T23:58:02Z

This would also have the side-effect of getting rid of the horrifying fork....

+1. This seems like a good direction. It looks like subbing in a Rust codec backend would just be a matter of changing the default codec pipeline.

although reopening the array so rust has the right metadata could be tricky. I don't think Store really handle opening arrays so I don't know how this would actually help with the metadata issue...but maybe there's a trick here I'm missing

Scanning over the zarrs-python source I think this is possible without any store interaction or array reopening. It looks like codec pipelines are created from array metadata, and those codecs in ArrayV3Metadata seem to include their metadata. A zarr-python CodecPipeline that wraps a zarrs CodecChain created with from_metadata would probably work. If from_metadata fails, zarrs does not support the codecs and a fallback can be used.

Also, zarrs codecs and the CodecChain do not interface with stores unless using partial decoders. So, I'm not sure wrapping any zarrs stores is necessary. However, there needs to be a bridge for the zarr-python ByteGetter for partial decoding.

LDeakin · 2024-09-28T00:05:23Z

Although, what you have at the moment is probably the best way to squeeze performance out of zarrs. I've not looked into how zarr-python is implemented above the CodecPipeline for operations like merging decoded chunks into a contiguous array, etc.

ilan-gold · 2024-09-28T07:10:31Z

Scanning over the zarrs-python source I think this is possible without any store interaction or array reopening. It looks like codec pipelines are created from array metadata, and those codecs in ArrayV3Metadata seem to include their metadata

I looked into this a bit...unfortunately create_codec_pipeline only passes in the codec itself for construction, and, despite the name, Metadata is just a dataclass for serlization purposes, nothing to do with actual metadata.

Here's a little code snipit:

import zarr
import numpy as np

arr = zarr.Array.create(await zarr.store.LocalStore.open('foo.zarr', mode="w"), shape=(100,), chunks=(10,), dtype="i4", codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()])
arr[:] = np.arange(100)
arr._async_array.codec_pipeline
(arr._async_array.store_path / "c/0").get

I don't see anything about metadata above, and the pipelines are called with the paths attached via the chunk encodings:

https://github.com/zarr-developers/zarr-python/blob/73b884bdabc3518236d5f8435dee8dfe319babd0/src/zarr/core/array.py#L600-L612

i.e.,self.store_path / self.metadata.encode_chunk_key(chunk_coords) has its own get method

LDeakin · 2024-09-28T08:24:26Z

I looked into this a bit...unfortunately create_codec_pipeline only passes in the codec itself for construction, and, despite the name, Metadata is just a dataclass for serlization purposes, nothing to do with actual metadata.

import zarr
from zarr.abc.codec import Codec
from zarr.codecs.pipeline import BatchedCodecPipeline
import json
from typing import Self, Iterable
from zarr.registry import register_pipeline

class CustomCodecPipeline(BatchedCodecPipeline):
    @classmethod
    def from_codecs(cls, codecs: Iterable[Codec]) -> Self:
        for codec in codecs:
            print(json.dumps(codec.to_dict()))
        # TODO

zarr.config.set(codec_pipeline={"path": "__main__.CustomCodecPipeline",})

register_pipeline(CustomCodecPipeline)

arr = zarr.zeros((100,), chunks=(10,), dtype='i4', codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()])

Outputs

{"name": "bytes", "configuration": {"endian": "little"}}
{"name": "blosc", "configuration": {"typesize": 4, "cname": "zstd", "clevel": 5, "shuffle": "shuffle", "blocksize": 0}}

That looks to be enough to create a zarrs CodecChain at least

ilan-gold · 2024-09-28T08:33:46Z

That looks to be enough to create a zarrs CodecChain at least

Definitely! Implicit to that comment was wanting to keep the io in rust as well so we could parallelize there as well as in the codecs but it's definitely worth experimenting with io in python + codecs in rust...or maybe some way to do io in rust via a custom Store. I could maybe see a Store whose get calls add data to some queue with a threadpool so you could get the parallelization and then the codecs would be implemented separately. We'd have to be careful about memory so as not to return to python anything unwanted. And then we'd maybe also want the combining of the decoded chunks in rust as well but maybe dlpack + an array library is good enough?

LDeakin · 2024-09-29T08:12:29Z

Implicit to that comment was wanting to keep the io in rust as well so we could parallelize there as well as in the codecs

Sounds optimal!

Since all I/O and encoding/decoding seems to happen under the Python CodecPipeline, maybe a custom implementation can throw everything at Rust, and no custom store is ever created on the Python side (provided that zarrs has a compatible store). It is not too dissimilar from what you already do on the Rust side.

ilan-gold · 2024-09-30T08:21:27Z

@LDeakin 100% agree. It should be the same rust API, all that changes is the python (in theory). I think we just need to make sure we can decode the chunk key in the pipeline (instead of relying on the file path).

ilan-gold added the investigation label Sep 27, 2024

ilan-gold changed the title ~~Maybe Refactor for CodecPipeline~~ Maybe Refactor for CodecPipeline + Store Sep 27, 2024

LDeakin mentioned this issue Nov 5, 2024

Refactor to a CodecPipeline #22

Merged

LDeakin closed this as completed in #22 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe Refactor for `CodecPipeline` + `Store` #13

Maybe Refactor for `CodecPipeline` + `Store` #13

ilan-gold commented Sep 27, 2024 •

edited

Loading

ilan-gold commented Sep 27, 2024

ilan-gold commented Sep 27, 2024 •

edited

Loading

LDeakin commented Sep 27, 2024

LDeakin commented Sep 28, 2024

ilan-gold commented Sep 28, 2024 •

edited

Loading

LDeakin commented Sep 28, 2024

ilan-gold commented Sep 28, 2024

LDeakin commented Sep 29, 2024

ilan-gold commented Sep 30, 2024

Maybe Refactor for CodecPipeline + Store #13

Maybe Refactor for CodecPipeline + Store #13

Comments

ilan-gold commented Sep 27, 2024 • edited Loading

ilan-gold commented Sep 27, 2024

ilan-gold commented Sep 27, 2024 • edited Loading

LDeakin commented Sep 27, 2024

LDeakin commented Sep 28, 2024

ilan-gold commented Sep 28, 2024 • edited Loading

LDeakin commented Sep 28, 2024

ilan-gold commented Sep 28, 2024

LDeakin commented Sep 29, 2024

ilan-gold commented Sep 30, 2024

Maybe Refactor for `CodecPipeline` + `Store` #13

Maybe Refactor for `CodecPipeline` + `Store` #13

ilan-gold commented Sep 27, 2024 •

edited

Loading

ilan-gold commented Sep 27, 2024 •

edited

Loading

ilan-gold commented Sep 28, 2024 •

edited

Loading