Narrow JSON type, ensure that `to_dict` always returns a dict, and v2 filter / compressor parsing #2179

d-v-b · 2024-09-12T20:50:24Z

A typing / metadata cleanup now that #2163 is in.

narrowing the JSON type

In v3, the JSON type looks like this:

I narrowed this type to

I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.
I switched dict[str, JSON] to Mapping[str, JSON] out of the same preference for immutability.
Allowing Enum is a back door around type safety, so I removed it; instead of serializing enums, the relevant classes serialize the enum values. Maybe we could put Enum back in JSON if there's a way to tell the type checker that our enums only wrap JSON values.

`to_dict` methods

Along with narrowing the JSON type, I ensured that all instances of Metadata return dict[str, JSON] from their to_dict methods.

type hints

There were a lot of functions in metadata that accepted Any, when object is a better type hint. I adjusted these accordingly.

v2 filters and codecs

We weren't doing any validation of filters and compressors for v2 metadata. This PR adds that missing validation, and I found a buggy test as a result of adding these checks. I also changed ArrayV2Metadata.filters and ArrayV2Metadata.compressor to be instances of numcodecs.abc.Codec instead of dicts, bringing ArrayV2Metadata closer to ArrayV3Metadata and simplifying some codec parsing code.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…f v2 metadata issues

dstansby · 2024-09-13T07:43:13Z

I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.

Given dictionaries are mutable and supported, is there another reason not to support lists? If you want to keep the typing simple, perhaps typing.Sequence[JSON] could be used?

d-v-b · 2024-09-13T07:49:34Z

I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.

Given dictionaries are mutable and supported, is there another reason not to support lists? If you want to keep the typing simple, perhaps typing.Sequence[JSON] could be used?

I would love an immutable dict, but that's not practical in python. So we have no choice but to include dicts in this type. And the goal here is not actually keeping typing simple but rather keeping runtime simple. Picking just 1 concrete type (tuple) for modelling JSON arrays removes complexity, and it's free because we control all the code that returns instances of the JSON type.

dstansby

Looks good overall to me - I'm not sure about widening the input types on the parsing functions though, as then the benefit of typing is lost. What was the motivation for setting these all to object instead of the narrower types that the functions are designed to parse?

src/zarr/codecs/pipeline.py

src/zarr/core/common.py

src/zarr/core/metadata/v2.py

d-v-b · 2024-09-13T08:03:48Z

Looks good overall to me - I'm not sure about widening the input types on the parsing functions though, as then the benefit of typing is lost. What was the motivation for setting these all to object instead of the narrower types that the functions are designed to parse?

the current model is that the parsing functions are designed to handle entirely unknown objects and parse those objects into something with the correct value and type. So these functions should accept any python object, but only return a value when parsing is successful, and otherwise raise an exception. From that POV, placing any constraints on the input type is stealing work from the body of the parsing function :)

dstansby · 2024-09-13T08:12:58Z

I'm not sure I understand what the downside is of "stealing work from the function"? The advantage of including the narrower types is finding errors before running code. That's kind of the point of typing - you can pass whatever you want in Python, but by adding type hints it's easier to see where you're passing the wrong objects.

dstansby · 2024-09-13T08:19:50Z

And where these parsing functions are used, they are called with objects that already have a narrow type, e.g., here cname already has a narrow type:

zarr-python/src/zarr/codecs/blosc.py

Lines 95 to 105 in 52d6849

    
           def __init__( 
        
               self, 
        
               *, 
        
               typesize: int | None = None, 
        
               cname: BloscCname | str = BloscCname.zstd, 
        
               clevel: int = 5, 
        
               shuffle: BloscShuffle | str | None = None, 
        
               blocksize: int = 0, 
        
           ) -> None: 
        
               typesize_parsed = parse_typesize(typesize) if typesize is not None else None 
        
               cname_parsed = parse_enum(cname, BloscCname)

So given these aren't being used with objects that have a general type, I think it is better to give the parsing functions narrower types as input, to make sure we're defining the narrow types consistently throughout the codebase?

d-v-b · 2024-09-13T08:21:04Z

I'm not sure I understand what the downside is of "stealing work from the function"? The advantage of including the narrower types is finding errors before running code. That's kind of the point of typing - you can pass whatever you want in Python, but by adding type hints it's easier to see where you're passing the wrong objects.

For context, these parsing functions are designed for handling the contents of JSON documents that we are reading from external sources. We have no idea what is inside those JSON documents. We can be pretty confident about that the output of python's JSON deserialization will give us dicts, lists, etc, None, etc but the particular structure is unknown.

The job of the parse_x functions is to take such data, about which we know basically nothing, parse it into something correct (i.e., an instance of the return type, with possible value checks) or raise an exception. If parse_x is opinionated about the type of its input, then we need to check the input before feeding it to parse_x. But checking the input is exactly the job of parse_x! That's what I mean by "stealing work".

Suppose we do parse_zarr_format(data: Literal[2]) -> Literal[2]. This is the signature of a meaningless function, because if we know that the data is 2 then we don't need any parsing. But if we tried to use this function to deal with unknown data, where would our assurance that data == 2 come from? We would need a function to take that unknown data and check that it's 2, which is what parse_zarr_format is for in the first place.

dstansby · 2024-09-13T08:24:53Z

Ah that makes sense, thanks for the patient explanation 👍 I didn't realise these were used to parse arbitrary JSON.

d-v-b · 2024-09-13T08:28:05Z

Ah that makes sense, thanks for the patient explanation 👍 I didn't realise these were used to parse arbitrary JSON.

no worries! thanks for looking things over carefully.

In terms of the design of the parse_x functions, I would love an alternative to exceptions for handling data that can't be coerced into the output type, e.g. returning sum type that wraps the parsed value or a failure, but that's not a python thing right now

…rom_dict methods

d-v-b · 2024-09-13T10:01:26Z

continuing this conversation in the main thread:

I removed the inheritance from Metadata, as well as the to_dict and from_dict methods for the codec pipeline classes, and everything worked 🤷 so those methods were not load bearing. I also renamed the CodecPipeline.from_list method to from_codecs, because a) it doesn't just take list (it takes iterable), and b) it's more informative to name the method after the items in the collection, rather than the collection itself.

@normanrz your eyes would be appreciated here

jhamman

👏 @d-v-b - fantastic set of simplifications here!

TomAugspurger · 2024-09-18T19:33:58Z

This seems to be causing tests/v3/test_array.py::test_serializable_sync_array to fail. I'm taking a look.

TomAugspurger · 2024-09-18T19:39:17Z

src/zarr/core/array.py

@@ -315,15 +313,9 @@ async def _create_v2(
            chunks=chunks,
            order=order,
            dimension_separator=dimension_separator,
-            fill_value=0 if fill_value is None else fill_value,


Restoring the 0 if fill_value is None else fill_value does fix the failure in tests/v3/test_array.py::test_serializable_sync_array.

But that said, I like the change here... Setting a default fill value here seems wrong.

Oh, but this is just for v2 metadata, which is allowed to be None. Maybe this isn't the wrong spot to do that.

d-v-b added 2 commits September 12, 2024 22:15

fix: narrow JSON type, ensure compliance with it, and fix a variety o…

6096a51

…f v2 metadata issues

remove unneeded conditional

7984ea7

d-v-b requested review from normanrz and jhamman September 12, 2024 20:50

dstansby reviewed Sep 13, 2024

View reviewed changes

src/zarr/codecs/pipeline.py Outdated Show resolved Hide resolved

src/zarr/core/common.py Show resolved Hide resolved

src/zarr/core/metadata/v2.py Show resolved Hide resolved

src/zarr/core/metadata/v2.py Show resolved Hide resolved

d-v-b added 2 commits September 13, 2024 11:53

codecpipeline no longer inherits from metadata, ditches to_dict and f…

fdffbb8

…rom_dict methods

rename from_list to from_codecs

14db760

jhamman added the V3 label Sep 13, 2024

jhamman added this to the 3.0.0 milestone Sep 13, 2024

d-v-b mentioned this pull request Sep 17, 2024

Backwards compatibility for reading Array.filters and Array.codecs #2194

Open

Merge branch 'v3' into refactor/to_dict-returns-dicts

cf287e6

jhamman approved these changes Sep 17, 2024

View reviewed changes

Merge branch 'v3' into refactor/to_dict-returns-dicts

e53ebab

d-v-b merged commit 8c5038a into zarr-developers:v3 Sep 17, 2024
4 of 10 checks passed

d-v-b deleted the refactor/to_dict-returns-dicts branch September 17, 2024 16:55

TomAugspurger reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narrow JSON type, ensure that `to_dict` always returns a dict, and v2 filter / compressor parsing #2179

Narrow JSON type, ensure that `to_dict` always returns a dict, and v2 filter / compressor parsing #2179

d-v-b commented Sep 12, 2024

dstansby commented Sep 13, 2024 •

edited

Loading

d-v-b commented Sep 13, 2024

dstansby left a comment

d-v-b commented Sep 13, 2024

dstansby commented Sep 13, 2024

dstansby commented Sep 13, 2024

d-v-b commented Sep 13, 2024 •

edited

Loading

dstansby commented Sep 13, 2024

d-v-b commented Sep 13, 2024 •

edited

Loading

d-v-b commented Sep 13, 2024 •

edited

Loading

jhamman left a comment

TomAugspurger commented Sep 18, 2024

TomAugspurger Sep 18, 2024

TomAugspurger Sep 18, 2024

Narrow JSON type, ensure that to_dict always returns a dict, and v2 filter / compressor parsing #2179

Narrow JSON type, ensure that to_dict always returns a dict, and v2 filter / compressor parsing #2179

Conversation

d-v-b commented Sep 12, 2024

narrowing the JSON type

to_dict methods

type hints

v2 filters and codecs

dstansby commented Sep 13, 2024 • edited Loading

d-v-b commented Sep 13, 2024

dstansby left a comment

Choose a reason for hiding this comment

d-v-b commented Sep 13, 2024

dstansby commented Sep 13, 2024

dstansby commented Sep 13, 2024

d-v-b commented Sep 13, 2024 • edited Loading

dstansby commented Sep 13, 2024

d-v-b commented Sep 13, 2024 • edited Loading

d-v-b commented Sep 13, 2024 • edited Loading

jhamman left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 18, 2024

TomAugspurger Sep 18, 2024

Choose a reason for hiding this comment

TomAugspurger Sep 18, 2024

Choose a reason for hiding this comment

Narrow JSON type, ensure that `to_dict` always returns a dict, and v2 filter / compressor parsing #2179

Narrow JSON type, ensure that `to_dict` always returns a dict, and v2 filter / compressor parsing #2179

`to_dict` methods

dstansby commented Sep 13, 2024 •

edited

Loading

d-v-b commented Sep 13, 2024 •

edited

Loading

d-v-b commented Sep 13, 2024 •

edited

Loading

d-v-b commented Sep 13, 2024 •

edited

Loading