Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe Refactor for CodecPipeline + Store #13

Closed
ilan-gold opened this issue Sep 27, 2024 · 9 comments · Fixed by #22
Closed

Maybe Refactor for CodecPipeline + Store #13

ilan-gold opened this issue Sep 27, 2024 · 9 comments · Fixed by #22

Comments

@ilan-gold
Copy link
Owner

ilan-gold commented Sep 27, 2024

A good idea from Norman: https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/zarr-python.20v.20tensorstore.20benchmark.3A.20sharded.20.20read.20.2F.20writes/near/447185735 to use https://github.com/zarr-developers/zarr-python/blob/5ca080d0fd21dbb2b6a8b101cc99d86de5039f08/src/zarr/codecs/pipeline.py#L59 as the entry point for rust

I'm not sure exactly how to go about this since it wasn't documented really at the time of my creating this library (although it seems a sprint is coming soon: zarr-developers/zarr-python#2215) so I went with the thing that looked the most familiar in the absence of something clearer. I don't think much would have to change since the arguments to the rust interface should basically be identical but could be a simpler way to integrate...although reopening the array so rust has the right metadata could be tricky. I don't think Store really handle opening arrays so I don't know how this would actually help with the metadata issue...but maybe there's a trick here I'm missing

@ilan-gold ilan-gold changed the title Maybe Refactor for CodecPipeline Maybe Refactor for CodecPipeline + Store Sep 27, 2024
@ilan-gold
Copy link
Owner Author

This would also have the side-effect of getting rid of the horrifying fork....

@ilan-gold
Copy link
Owner Author

ilan-gold commented Sep 27, 2024

Another thing about this - falling back to a "working" implementation. Right now it's very easy so the library just "works" whether or not the rust codepath can be used but I'm not sure how we would fall back to a different codec...maybe instantiating a different pipeline within our pipeline and then calling it conditionally?

@LDeakin
Copy link
Collaborator

LDeakin commented Sep 27, 2024

This would also have the side-effect of getting rid of the horrifying fork....

+1. This seems like a good direction. It looks like subbing in a Rust codec backend would just be a matter of changing the default codec pipeline.

although reopening the array so rust has the right metadata could be tricky. I don't think Store really handle opening arrays so I don't know how this would actually help with the metadata issue...but maybe there's a trick here I'm missing

Scanning over the zarrs-python source I think this is possible without any store interaction or array reopening. It looks like codec pipelines are created from array metadata, and those codecs in ArrayV3Metadata seem to include their metadata. A zarr-python CodecPipeline that wraps a zarrs CodecChain created with from_metadata would probably work. If from_metadata fails, zarrs does not support the codecs and a fallback can be used.

Also, zarrs codecs and the CodecChain do not interface with stores unless using partial decoders. So, I'm not sure wrapping any zarrs stores is necessary. However, there needs to be a bridge for the zarr-python ByteGetter for partial decoding.

@LDeakin
Copy link
Collaborator

LDeakin commented Sep 28, 2024

Although, what you have at the moment is probably the best way to squeeze performance out of zarrs. I've not looked into how zarr-python is implemented above the CodecPipeline for operations like merging decoded chunks into a contiguous array, etc.

@ilan-gold
Copy link
Owner Author

ilan-gold commented Sep 28, 2024

Scanning over the zarrs-python source I think this is possible without any store interaction or array reopening. It looks like codec pipelines are created from array metadata, and those codecs in ArrayV3Metadata seem to include their metadata

I looked into this a bit...unfortunately create_codec_pipeline only passes in the codec itself for construction, and, despite the name, Metadata is just a dataclass for serlization purposes, nothing to do with actual metadata.

Here's a little code snipit:

import zarr
import numpy as np

arr = zarr.Array.create(await zarr.store.LocalStore.open('foo.zarr', mode="w"), shape=(100,), chunks=(10,), dtype="i4", codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()])
arr[:] = np.arange(100)
arr._async_array.codec_pipeline
(arr._async_array.store_path / "c/0").get

I don't see anything about metadata above, and the pipelines are called with the paths attached via the chunk encodings:

https://github.com/zarr-developers/zarr-python/blob/73b884bdabc3518236d5f8435dee8dfe319babd0/src/zarr/core/array.py#L600-L612

i.e.,self.store_path / self.metadata.encode_chunk_key(chunk_coords) has its own get method

@LDeakin
Copy link
Collaborator

LDeakin commented Sep 28, 2024

I looked into this a bit...unfortunately create_codec_pipeline only passes in the codec itself for construction, and, despite the name, Metadata is just a dataclass for serlization purposes, nothing to do with actual metadata.

import zarr
from zarr.abc.codec import Codec
from zarr.codecs.pipeline import BatchedCodecPipeline
import json
from typing import Self, Iterable
from zarr.registry import register_pipeline

class CustomCodecPipeline(BatchedCodecPipeline):
    @classmethod
    def from_codecs(cls, codecs: Iterable[Codec]) -> Self:
        for codec in codecs:
            print(json.dumps(codec.to_dict()))
        # TODO

zarr.config.set(codec_pipeline={"path": "__main__.CustomCodecPipeline",})

register_pipeline(CustomCodecPipeline)

arr = zarr.zeros((100,), chunks=(10,), dtype='i4', codecs=[zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()])

Outputs

{"name": "bytes", "configuration": {"endian": "little"}}
{"name": "blosc", "configuration": {"typesize": 4, "cname": "zstd", "clevel": 5, "shuffle": "shuffle", "blocksize": 0}}

That looks to be enough to create a zarrs CodecChain at least

@ilan-gold
Copy link
Owner Author

That looks to be enough to create a zarrs CodecChain at least

Definitely! Implicit to that comment was wanting to keep the io in rust as well so we could parallelize there as well as in the codecs but it's definitely worth experimenting with io in python + codecs in rust...or maybe some way to do io in rust via a custom Store. I could maybe see a Store whose get calls add data to some queue with a threadpool so you could get the parallelization and then the codecs would be implemented separately. We'd have to be careful about memory so as not to return to python anything unwanted. And then we'd maybe also want the combining of the decoded chunks in rust as well but maybe dlpack + an array library is good enough?

@LDeakin
Copy link
Collaborator

LDeakin commented Sep 29, 2024

Implicit to that comment was wanting to keep the io in rust as well so we could parallelize there as well as in the codecs

Sounds optimal!

Since all I/O and encoding/decoding seems to happen under the Python CodecPipeline, maybe a custom implementation can throw everything at Rust, and no custom store is ever created on the Python side (provided that zarrs has a compatible store). It is not too dissimilar from what you already do on the Rust side.

@ilan-gold
Copy link
Owner Author

@LDeakin 100% agree. It should be the same rust API, all that changes is the python (in theory). I think we just need to make sure we can decode the chunk key in the pipeline (instead of relying on the file path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants