-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to a CodecPipeline
#22
Conversation
* (fix): return bytes as numpy array * (fix): reshape after making view
- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py
- Add config options - CodecPipelineImpl interior mutability
* (fix): ci for running tests * (fix): no need to extract tests * (Fix): remove duplicate name * (fix): use a submodule... * (chore): remove memory store + port zarr codec tests * (chore): remove `dlpark` * (fix): getattr for itemsize * (chore): remove runtime F ordering * (chore): skip vlen * (feat): parse int --------- Co-authored-by: Lachlan Deakin <[email protected]>
* (fix): minimum working codec pipeline - Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py * (fix): handle relative filesystem paths * (fix): minimal error handling and fix clippy warnings * (fix): handle relative filesystem paths take 2 * (fix): constant handling * (fix): add `retrieve_chunks` for parallel read - Add config options - CodecPipelineImpl interior mutability * (fix): add `store_chunks` for parallel write * (fix) convert write value to contiguous array if needed * (fix): bump zarrs to 21e86cb9 * CI for codec pipeline (#20) * (fix): ci for running tests * (fix): no need to extract tests * (Fix): remove duplicate name * (fix): use a submodule... * (chore): remove memory store + port zarr codec tests * (chore): remove `dlpark` * (fix): getattr for itemsize * (chore): remove runtime F ordering * (chore): skip vlen * (feat): parse int --------- Co-authored-by: Lachlan Deakin <[email protected]> * (fix): support writing arrays with non-native endianness * (fix): disable bad/unsupported invalid metadata tests * (fix): do not store empty chunks * (fix) remove dead code in codec pipeline * (fix): move some selection logic from Rust to Python * (chore): `chunks_desc` cleanup * (feat): adding concurrency via zarr config * (chore): remove extra comment * (fix): refactor chunk info creation + threads->threading + ruff * (fix): use `or` for `threading.max_workers` getting --------- Co-authored-by: Lachlan Deakin <[email protected]>
Co-authored-by: ilan-gold <[email protected]>
Also add `py_untyped_array_to_array_object`
- Fixed fill value bytes being larger than needed when storing - Handle reduced dimensionality inputs/outputs - Uses LDeakin/zarrs@8c1391f - This inconsistency was picked up by `zarrs` in debug build
ce84653
to
2b1b1f2
Compare
Co-authored-by: Lachlan Deakin <[email protected]>
- Move filesystem store implementation to a submodule
This is the minimum set by `zarr` 3.0.0b1
I think pretty much everything is addressed on the Rust side, apart from the CI failure: #32 |
The CodecPipelineStore trait looks wonderful, exactly what I had in mind, thanks! |
* (fix): use the standard uv GH action * cache with pyproject.toml * ci: remove install rust * (chore): add rust-cache to CI * (fix): add note in CI about rust-toolchain action
…ndexing for read (#30) * (chore): file structure * (chore): parametrize tests to get full scope of possibilities * (chore): xfail tests that fail on zarr-python default pipeline * (fix) singular * (fix): check for contiguous index arrays * (fix): contiguous numpy arrays converted to slices * (feat): add reading for non-contiguous buffers * (chore): remove unused imports * (fix): cleanup unwraps in `retrieve_chunks` * Refactor full indexing (#34) * (chore): `make_chunk_info_for_rust` cleanup * (fix): all tests working except "tests/test_pipeline.py::test_roundtrip[vindex-contiguous_in_chunk_array-contiguous_in_chunk_array]" * (fix): skip read in `store_chunk_subset_bytes` for full chunks * (fix): improve dropped index detection + disallow integer write case * (chore): message more specific * (fix): use `Exception` * (chore): erroneous comment * (chore): `drop_axes` default * (chore): `drop_axes` param * (chore): apply review * (chore): `else` branch * (chore): add basic nd tests (#35) * (chore): add basic 3d tests * (refactor): use `pytest_generate_tests` * (fix): clarify collapsed dimension behavior * (chore): clean ups --------- Co-authored-by: Lachlan Deakin <[email protected]> Co-authored-by: Philipp A. <[email protected]>
|
||
|
||
## `ld/codec_pipeline` branch | ||
``` | ||
maturin develop -r | ||
./examples/simple.py | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a separate PR documentation - this isn't really valid anymore
codec_metadata_json, | ||
config.get("codec_pipeline.validate_checksums", None), | ||
config.get("codec_pipeline.store_empty_chunks", None), | ||
config.get("codec_pipeline.concurrent_target", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand (and perhaps to prompt a change of name):
This concurrency manages the pipeline-level concurrency i.e., eaech request made to the pipeline may be for multiple chunks, and on this we parallelize.
Then there is the "outer" concurrency which says how many requests to the pipeline are made at once (from rayon
)?
So should these both be set in CodecPipelineImpl
at once? Should they be coordinated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the rust side there is:
- codec concurrency (this is what
concurrent_target
currently sets)- via
rayon
(e.g. sharding codec) or specialised methods in codecs
- via
- chunk concurrency (set by
threading.max_workers
)- via
rayon
- via
But I think there is more going on in zarr-python
/ dask
that is possibly adding another layer of chunk concurrency.
So yes, it is worth renaming concurrent_target
; its usage has diverged from zarrs
. Maybe chunk_thread_limit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't change this just yet; I'll put some time in this weekend to get some zarrs
-like auto codec/chunk concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we'll see what's up next week then.
impl CodecPipelineStoreFilesystem { | ||
pub fn new() -> PyResult<Self> { | ||
let store = Arc::new(FilesystemStore::new("/").map_py_err::<PyRuntimeError>()?); | ||
let cwd = std::env::current_dir()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll get this windows check once we move to a full build system after we merge this PR and go public
* (fix): support zarr 3.0.0b2 * (fix): open store read_only in test_roundtrip_read_only_zarrs * (fix): pin zarrs revision I think an old version is cached? * (chore): add Cargo.toml to uv dependency glob * (fix): unquote uv dependency glob * (chore): bump `zarrs` to 0.18.0-beta.0
Resolves #2
Resolves #6
Resolves #7
Resolves #8
Resolves #13