Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to a CodecPipeline #22

Merged
merged 53 commits into from
Nov 15, 2024
Merged

Refactor to a CodecPipeline #22

merged 53 commits into from
Nov 15, 2024

Conversation

LDeakin
Copy link
Collaborator

@LDeakin LDeakin commented Nov 5, 2024

Resolves #2
Resolves #6
Resolves #7
Resolves #8
Resolves #13

LDeakin and others added 27 commits October 22, 2024 10:37
* (fix): return bytes as numpy array

* (fix): reshape after making view
- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes`
- Separate `retrieve_chunk` and `retrieve_chunk_subset`
- Separate `store_chunk` and `store_chunk_subset`
- Add assertions to simple.py
- Add config options
- CodecPipelineImpl interior mutability
* (fix): ci for running tests

* (fix): no need to extract tests

* (Fix): remove duplicate name

* (fix): use a submodule...

* (chore): remove memory store + port zarr codec tests

* (chore): remove `dlpark`

* (fix): getattr for itemsize

* (chore): remove runtime F ordering

* (chore): skip vlen

* (feat): parse int

---------

Co-authored-by: Lachlan Deakin <[email protected]>
* (fix): minimum working codec pipeline

- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes`
- Separate `retrieve_chunk` and `retrieve_chunk_subset`
- Separate `store_chunk` and `store_chunk_subset`
- Add assertions to simple.py

* (fix): handle relative filesystem paths

* (fix): minimal error handling and fix clippy warnings

* (fix): handle relative filesystem paths take 2

* (fix): constant handling

* (fix): add `retrieve_chunks` for parallel read

- Add config options
- CodecPipelineImpl interior mutability

* (fix): add `store_chunks` for parallel write

* (fix) convert write value to contiguous array if needed

* (fix): bump zarrs to 21e86cb9

* CI for codec pipeline (#20)

* (fix): ci for running tests

* (fix): no need to extract tests

* (Fix): remove duplicate name

* (fix): use a submodule...

* (chore): remove memory store + port zarr codec tests

* (chore): remove `dlpark`

* (fix): getattr for itemsize

* (chore): remove runtime F ordering

* (chore): skip vlen

* (feat): parse int

---------

Co-authored-by: Lachlan Deakin <[email protected]>

* (fix): support writing arrays with non-native endianness

* (fix): disable bad/unsupported invalid metadata tests

* (fix): do not store empty chunks

* (fix) remove dead code in codec pipeline

* (fix): move some selection logic from Rust to Python

* (chore): `chunks_desc` cleanup

* (feat): adding concurrency via zarr config

* (chore): remove extra comment

* (fix): refactor chunk info creation + threads->threading + ruff

* (fix): use `or` for `threading.max_workers` getting

---------

Co-authored-by: Lachlan Deakin <[email protected]>
@ilan-gold
Copy link
Owner

@LDeakin see: #23

Going to log off for the night but will check in the morning of course :)

ilan-gold and others added 4 commits November 8, 2024 13:56
Co-authored-by: Lachlan Deakin <[email protected]>
- Move filesystem store implementation to a submodule
This is the minimum set by `zarr` 3.0.0b1
@LDeakin
Copy link
Collaborator Author

LDeakin commented Nov 8, 2024

I think pretty much everything is addressed on the Rust side, apart from the .itemsize() stuff

CI failure: #32

@flying-sheep
Copy link
Collaborator

The CodecPipelineStore trait looks wonderful, exactly what I had in mind, thanks!

src/lib.rs Outdated Show resolved Hide resolved
LDeakin and others added 2 commits November 15, 2024 06:49
* (fix): use the standard uv GH action

* cache with pyproject.toml

* ci: remove install rust

* (chore): add rust-cache to CI

* (fix): add note in CI about rust-toolchain action
…ndexing for read (#30)

* (chore): file structure

* (chore): parametrize tests to get full scope of possibilities

* (chore): xfail tests that fail on zarr-python default pipeline

* (fix) singular

* (fix): check for contiguous index arrays

* (fix): contiguous numpy arrays converted to slices

* (feat): add reading for non-contiguous buffers

* (chore): remove unused imports

* (fix): cleanup unwraps in `retrieve_chunks`

* Refactor full indexing (#34)

* (chore): `make_chunk_info_for_rust` cleanup

* (fix): all tests working except "tests/test_pipeline.py::test_roundtrip[vindex-contiguous_in_chunk_array-contiguous_in_chunk_array]"

* (fix): skip read in `store_chunk_subset_bytes` for full chunks

* (fix): improve dropped index detection + disallow integer write case

* (chore): message more specific

* (fix): use `Exception`

* (chore): erroneous comment

* (chore): `drop_axes` default

* (chore): `drop_axes` param

* (chore): apply review

* (chore): `else` branch

* (chore): add basic nd tests (#35)

* (chore): add basic 3d tests

* (refactor): use `pytest_generate_tests`

* (fix): clarify collapsed dimension behavior

* (chore): clean ups

---------

Co-authored-by: Lachlan Deakin <[email protected]>
Co-authored-by: Philipp A. <[email protected]>
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Comment on lines +23 to +29


## `ld/codec_pipeline` branch
```
maturin develop -r
./examples/simple.py
```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a separate PR documentation - this isn't really valid anymore

codec_metadata_json,
config.get("codec_pipeline.validate_checksums", None),
config.get("codec_pipeline.store_empty_chunks", None),
config.get("codec_pipeline.concurrent_target", None),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand (and perhaps to prompt a change of name):

This concurrency manages the pipeline-level concurrency i.e., eaech request made to the pipeline may be for multiple chunks, and on this we parallelize.

Then there is the "outer" concurrency which says how many requests to the pipeline are made at once (from rayon)?

So should these both be set in CodecPipelineImpl at once? Should they be coordinated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the rust side there is:

  • codec concurrency (this is what concurrent_target currently sets)
    • via rayon (e.g. sharding codec) or specialised methods in codecs
  • chunk concurrency (set by threading.max_workers)
    • via rayon

But I think there is more going on in zarr-python / dask that is possibly adding another layer of chunk concurrency.
So yes, it is worth renaming concurrent_target; its usage has diverged from zarrs. Maybe chunk_thread_limit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't change this just yet; I'll put some time in this weekend to get some zarrs-like auto codec/chunk concurrency.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we'll see what's up next week then.

impl CodecPipelineStoreFilesystem {
pub fn new() -> PyResult<Self> {
let store = Arc::new(FilesystemStore::new("/").map_py_err::<PyRuntimeError>()?);
let cwd = std::env::current_dir()?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll get this windows check once we move to a full build system after we merge this PR and go public

* (fix): support zarr 3.0.0b2

* (fix): open store read_only in test_roundtrip_read_only_zarrs

* (fix): pin zarrs revision

I think an old version is cached?

* (chore): add Cargo.toml to uv dependency glob

* (fix): unquote uv dependency glob

* (chore): bump `zarrs` to 0.18.0-beta.0
@LDeakin LDeakin enabled auto-merge (rebase) November 15, 2024 11:45
@LDeakin LDeakin disabled auto-merge November 15, 2024 11:45
@LDeakin LDeakin merged commit c315d77 into main Nov 15, 2024
1 check passed
@LDeakin LDeakin deleted the ld/codec_pipeline branch November 15, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants