Refactor to a `CodecPipeline` #22

LDeakin · 2024-11-05T20:18:28Z

Resolves #2
Resolves #6
Resolves #7
Resolves #8
Resolves #13

* (fix): return bytes as numpy array * (fix): reshape after making view

- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py

- Add config options - CodecPipelineImpl interior mutability

* (fix): ci for running tests * (fix): no need to extract tests * (Fix): remove duplicate name * (fix): use a submodule... * (chore): remove memory store + port zarr codec tests * (chore): remove `dlpark` * (fix): getattr for itemsize * (chore): remove runtime F ordering * (chore): skip vlen * (feat): parse int --------- Co-authored-by: Lachlan Deakin <[email protected]>

* (fix): minimum working codec pipeline - Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py * (fix): handle relative filesystem paths * (fix): minimal error handling and fix clippy warnings * (fix): handle relative filesystem paths take 2 * (fix): constant handling * (fix): add `retrieve_chunks` for parallel read - Add config options - CodecPipelineImpl interior mutability * (fix): add `store_chunks` for parallel write * (fix) convert write value to contiguous array if needed * (fix): bump zarrs to 21e86cb9 * CI for codec pipeline (#20) * (fix): ci for running tests * (fix): no need to extract tests * (Fix): remove duplicate name * (fix): use a submodule... * (chore): remove memory store + port zarr codec tests * (chore): remove `dlpark` * (fix): getattr for itemsize * (chore): remove runtime F ordering * (chore): skip vlen * (feat): parse int --------- Co-authored-by: Lachlan Deakin <[email protected]> * (fix): support writing arrays with non-native endianness * (fix): disable bad/unsupported invalid metadata tests * (fix): do not store empty chunks * (fix) remove dead code in codec pipeline * (fix): move some selection logic from Rust to Python * (chore): `chunks_desc` cleanup * (feat): adding concurrency via zarr config * (chore): remove extra comment * (fix): refactor chunk info creation + threads->threading + ruff * (fix): use `or` for `threading.max_workers` getting --------- Co-authored-by: Lachlan Deakin <[email protected]>

ilan-gold · 2024-11-05T20:49:34Z

@LDeakin see: #23

Going to log off for the night but will check in the morning of course :)

Co-authored-by: ilan-gold <[email protected]>

Also add `py_untyped_array_to_array_object`

- Fixed fill value bytes being larger than needed when storing - Handle reduced dimensionality inputs/outputs - Uses LDeakin/zarrs@8c1391f - This inconsistency was picked up by `zarrs` in debug build

Co-authored-by: Lachlan Deakin <[email protected]>

- Move filesystem store implementation to a submodule

Broken with 4c1ce39

This is the minimum set by `zarr` 3.0.0b1

LDeakin · 2024-11-08T05:55:55Z

I think pretty much everything is addressed on the Rust side, apart from the .itemsize() stuff

CI failure: #32

flying-sheep · 2024-11-08T08:01:09Z

The CodecPipelineStore trait looks wonderful, exactly what I had in mind, thanks!

src/lib.rs

* (fix): use the standard uv GH action * cache with pyproject.toml * ci: remove install rust * (chore): add rust-cache to CI * (fix): add note in CI about rust-toolchain action

…ndexing for read (#30) * (chore): file structure * (chore): parametrize tests to get full scope of possibilities * (chore): xfail tests that fail on zarr-python default pipeline * (fix) singular * (fix): check for contiguous index arrays * (fix): contiguous numpy arrays converted to slices * (feat): add reading for non-contiguous buffers * (chore): remove unused imports * (fix): cleanup unwraps in `retrieve_chunks` * Refactor full indexing (#34) * (chore): `make_chunk_info_for_rust` cleanup * (fix): all tests working except "tests/test_pipeline.py::test_roundtrip[vindex-contiguous_in_chunk_array-contiguous_in_chunk_array]" * (fix): skip read in `store_chunk_subset_bytes` for full chunks * (fix): improve dropped index detection + disallow integer write case * (chore): message more specific * (fix): use `Exception` * (chore): erroneous comment * (chore): `drop_axes` default * (chore): `drop_axes` param * (chore): apply review * (chore): `else` branch * (chore): add basic nd tests (#35) * (chore): add basic 3d tests * (refactor): use `pytest_generate_tests` * (fix): clarify collapsed dimension behavior * (chore): clean ups --------- Co-authored-by: Lachlan Deakin <[email protected]> Co-authored-by: Philipp A. <[email protected]>

Cargo.toml

ilan-gold · 2024-11-15T10:07:48Z

README.md

+
+
+## `ld/codec_pipeline` branch
+```
+maturin develop -r
+./examples/simple.py
+```


I will make a separate PR documentation - this isn't really valid anymore

ilan-gold · 2024-11-15T10:14:12Z

python/zarrs_python/pipeline.py

+                codec_metadata_json,
+                config.get("codec_pipeline.validate_checksums", None),
+                config.get("codec_pipeline.store_empty_chunks", None),
+                config.get("codec_pipeline.concurrent_target", None),


Just so I understand (and perhaps to prompt a change of name):

This concurrency manages the pipeline-level concurrency i.e., eaech request made to the pipeline may be for multiple chunks, and on this we parallelize.

Then there is the "outer" concurrency which says how many requests to the pipeline are made at once (from rayon)?

So should these both be set in CodecPipelineImpl at once? Should they be coordinated?

On the rust side there is:

codec concurrency (this is what concurrent_target currently sets)

via rayon (e.g. sharding codec) or specialised methods in codecs

chunk concurrency (set by threading.max_workers)

via rayon

But I think there is more going on in zarr-python / dask that is possibly adding another layer of chunk concurrency.
So yes, it is worth renaming concurrent_target; its usage has diverged from zarrs. Maybe chunk_thread_limit?

Maybe we don't change this just yet; I'll put some time in this weekend to get some zarrs-like auto codec/chunk concurrency.

Ok, we'll see what's up next week then.

ilan-gold · 2024-11-15T10:17:16Z

src/codec_pipeline_store_filesystem.rs

+impl CodecPipelineStoreFilesystem {
+    pub fn new() -> PyResult<Self> {
+        let store = Arc::new(FilesystemStore::new("/").map_py_err::<PyRuntimeError>()?);
+        let cwd = std::env::current_dir()?


We'll get this windows check once we move to a full build system after we merge this PR and go public

* (fix): support zarr 3.0.0b2 * (fix): open store read_only in test_roundtrip_read_only_zarrs * (fix): pin zarrs revision I think an old version is cached? * (chore): add Cargo.toml to uv dependency glob * (fix): unquote uv dependency glob * (chore): bump `zarrs` to 0.18.0-beta.0

LDeakin and others added 27 commits October 22, 2024 10:37

Add a CodecPipeline stub

b4c23a2

Pass chunk spec

363e565

(fix): return bytes as numpy array (#18)

8fc8fb8

* (fix): return bytes as numpy array * (fix): reshape after making view

(fix): remove unused code

a632103

(fix): error handling in get_store_and_path

c325c45

(fix): change example/simple.py to a 2D array

ffeb7cc

(fix): pass value to store_chunk_subset

27d920c

(fix): clippy warnings

7833226

(fix): handle missing chunks in retrieve_chunk_subset and cleanup

429cb10

(fix): panics to errors in get_store_and_path

86200ed

(fix): add partial reads/writes to simply.py

3296ed3

(fix): minimum working codec pipeline

b05c73c

- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py

(fix): handle relative filesystem paths

8e1f7f3

(fix): minimal error handling and fix clippy warnings

3000d49

(fix): handle relative filesystem paths take 2

f5fd7a2

(fix): constant handling

0e8ea6e

(fix): add retrieve_chunks for parallel read

d8d94a3

- Add config options - CodecPipelineImpl interior mutability

(fix): add store_chunks for parallel write

8b3c244

(fix) convert write value to contiguous array if needed

a0f7bd4

(fix): bump zarrs to 21e86cb9

fa46adf

(fix): support writing arrays with non-native endianness

f937c77

(fix): disable bad/unsupported invalid metadata tests

2f78a13

(fix): do not store empty chunks

fe265cc

(fix) remove dead code in codec pipeline

9419c3b

(fix): move some selection logic from Rust to Python

bd71529

LDeakin requested review from ilan-gold and flying-sheep November 5, 2024 20:32

flying-sheep and others added 9 commits November 7, 2024 15:37

Improve error mapping code (#28)

6d54238

Co-authored-by: ilan-gold <[email protected]>

(chore): add pyarray_itemsize()

8d552c5

(chore): add safety comments to ndarray_to_* methods

87199d7

Also add `py_untyped_array_to_array_object`

(chore): label fields in ChunksItemRaw

f3854d3

(fix): fix store/retrieve with scalars + refactor

2c77c3b

- Fixed fill value bytes being larger than needed when storing - Handle reduced dimensionality inputs/outputs - Uses LDeakin/zarrs@8c1391f - This inconsistency was picked up by `zarrs` in debug build

(fix): address several unwraps

651cbf2

(chore): cleanup slice_to_range

9c7ea55

(chore): change retrieve_chunk_bytes to return ArrayBytes

1b4be36

(chore): add SAFETY docs and store_chunk_subset_bytes input validation

2b1b1f2

LDeakin force-pushed the ld/codec_pipeline branch from ce84653 to 2b1b1f2 Compare November 8, 2024 02:47

ilan-gold and others added 4 commits November 8, 2024 13:56

(chore): file structure (#27)

276b24b

Co-authored-by: Lachlan Deakin <[email protected]>

(feat): refactor CodecPipelineStore to a trait

4c1ce39

- Move filesystem store implementation to a submodule

(fix): bring back filesystem relative path support

cb94215

Broken with 4c1ce39

(fix): bump requires-python to 3.11

0067bce

This is the minimum set by `zarr` 3.0.0b1

flying-sheep added 2 commits November 8, 2024 21:19

Rust tests (#33)

2d33338

Simplify shape calculation

a859bd8

ilan-gold reviewed Nov 11, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

LDeakin and others added 2 commits November 15, 2024 06:49

(chore): add rust-cache to CI and use the standard uv action (#32)

e4cdc9c

* (fix): use the standard uv GH action * cache with pyproject.toml * ci: remove install rust * (chore): add rust-cache to CI * (fix): add note in CI about rust-toolchain action

ilan-gold approved these changes Nov 15, 2024

View reviewed changes

LDeakin added 2 commits November 15, 2024 21:29

(fix): remove unused rust features and paste

ad06f44

LDeakin enabled auto-merge (rebase) November 15, 2024 11:45

LDeakin disabled auto-merge November 15, 2024 11:45

flying-sheep approved these changes Nov 15, 2024

View reviewed changes

LDeakin merged commit c315d77 into main Nov 15, 2024
1 check passed

LDeakin deleted the ld/codec_pipeline branch November 15, 2024 18:45

LDeakin mentioned this pull request Dec 29, 2024

Array Ecosystem Interoperability LDeakin/zarrs#113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to a `CodecPipeline` #22

Refactor to a `CodecPipeline` #22

LDeakin commented Nov 5, 2024

ilan-gold commented Nov 5, 2024

LDeakin commented Nov 8, 2024

flying-sheep commented Nov 8, 2024

ilan-gold Nov 15, 2024

ilan-gold Nov 15, 2024

LDeakin Nov 15, 2024

LDeakin Nov 15, 2024

ilan-gold Nov 15, 2024

ilan-gold Nov 15, 2024

Refactor to a CodecPipeline #22

Refactor to a CodecPipeline #22

Conversation

LDeakin commented Nov 5, 2024

ilan-gold commented Nov 5, 2024

LDeakin commented Nov 8, 2024

flying-sheep commented Nov 8, 2024

ilan-gold Nov 15, 2024

Choose a reason for hiding this comment

ilan-gold Nov 15, 2024

Choose a reason for hiding this comment

LDeakin Nov 15, 2024

Choose a reason for hiding this comment

LDeakin Nov 15, 2024

Choose a reason for hiding this comment

ilan-gold Nov 15, 2024

Choose a reason for hiding this comment

ilan-gold Nov 15, 2024

Choose a reason for hiding this comment

Refactor to a `CodecPipeline` #22

Refactor to a `CodecPipeline` #22