Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narrow JSON type, ensure that to_dict always returns a dict, and v2 filter / compressor parsing #2179

Merged
merged 6 commits into from
Sep 17, 2024

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Sep 12, 2024

A typing / metadata cleanup now that #2163 is in.

narrowing the JSON type

In v3, the JSON type looks like this:

JSON = None | str | int | float | Enum | dict[str, "JSON"] | list["JSON"] | tuple["JSON", ...]

I narrowed this type to

JSON = None | str | int | float | Mapping[str, "JSON"] | tuple["JSON", ...]

  • I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.
  • I switched dict[str, JSON] to Mapping[str, JSON] out of the same preference for immutability.
  • Allowing Enum is a back door around type safety, so I removed it; instead of serializing enums, the relevant classes serialize the enum values. Maybe we could put Enum back in JSON if there's a way to tell the type checker that our enums only wrap JSON values.

to_dict methods

Along with narrowing the JSON type, I ensured that all instances of Metadata return dict[str, JSON] from their to_dict methods.

type hints

There were a lot of functions in metadata that accepted Any, when object is a better type hint. I adjusted these accordingly.

v2 filters and codecs

We weren't doing any validation of filters and compressors for v2 metadata. This PR adds that missing validation, and I found a buggy test as a result of adding these checks. I also changed ArrayV2Metadata.filters and ArrayV2Metadata.compressor to be instances of numcodecs.abc.Codec instead of dicts, bringing ArrayV2Metadata closer to ArrayV3Metadata and simplifying some codec parsing code.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested review from normanrz and jhamman September 12, 2024 20:50
@dstansby
Copy link
Contributor

dstansby commented Sep 13, 2024

I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.

Given dictionaries are mutable and supported, is there another reason not to support lists? If you want to keep the typing simple, perhaps typing.Sequence[JSON] could be used?

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 13, 2024

I don't think we need to support list AND tuple; as tuple is immutable, I got rid of the list variant.

Given dictionaries are mutable and supported, is there another reason not to support lists? If you want to keep the typing simple, perhaps typing.Sequence[JSON] could be used?

I would love an immutable dict, but that's not practical in python. So we have no choice but to include dicts in this type. And the goal here is not actually keeping typing simple but rather keeping runtime simple. Picking just 1 concrete type (tuple) for modelling JSON arrays removes complexity, and it's free because we control all the code that returns instances of the JSON type.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall to me - I'm not sure about widening the input types on the parsing functions though, as then the benefit of typing is lost. What was the motivation for setting these all to object instead of the narrower types that the functions are designed to parse?

src/zarr/codecs/pipeline.py Outdated Show resolved Hide resolved
src/zarr/core/common.py Show resolved Hide resolved
src/zarr/core/metadata/v2.py Show resolved Hide resolved
src/zarr/core/metadata/v2.py Show resolved Hide resolved
@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 13, 2024

Looks good overall to me - I'm not sure about widening the input types on the parsing functions though, as then the benefit of typing is lost. What was the motivation for setting these all to object instead of the narrower types that the functions are designed to parse?

the current model is that the parsing functions are designed to handle entirely unknown objects and parse those objects into something with the correct value and type. So these functions should accept any python object, but only return a value when parsing is successful, and otherwise raise an exception. From that POV, placing any constraints on the input type is stealing work from the body of the parsing function :)

@dstansby
Copy link
Contributor

I'm not sure I understand what the downside is of "stealing work from the function"? The advantage of including the narrower types is finding errors before running code. That's kind of the point of typing - you can pass whatever you want in Python, but by adding type hints it's easier to see where you're passing the wrong objects.

@dstansby
Copy link
Contributor

And where these parsing functions are used, they are called with objects that already have a narrow type, e.g., here cname already has a narrow type:

def __init__(
self,
*,
typesize: int | None = None,
cname: BloscCname | str = BloscCname.zstd,
clevel: int = 5,
shuffle: BloscShuffle | str | None = None,
blocksize: int = 0,
) -> None:
typesize_parsed = parse_typesize(typesize) if typesize is not None else None
cname_parsed = parse_enum(cname, BloscCname)

So given these aren't being used with objects that have a general type, I think it is better to give the parsing functions narrower types as input, to make sure we're defining the narrow types consistently throughout the codebase?

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 13, 2024

I'm not sure I understand what the downside is of "stealing work from the function"? The advantage of including the narrower types is finding errors before running code. That's kind of the point of typing - you can pass whatever you want in Python, but by adding type hints it's easier to see where you're passing the wrong objects.

For context, these parsing functions are designed for handling the contents of JSON documents that we are reading from external sources. We have no idea what is inside those JSON documents. We can be pretty confident about that the output of python's JSON deserialization will give us dicts, lists, etc, None, etc but the particular structure is unknown.

The job of the parse_x functions is to take such data, about which we know basically nothing, parse it into something correct (i.e., an instance of the return type, with possible value checks) or raise an exception. If parse_x is opinionated about the type of its input, then we need to check the input before feeding it to parse_x. But checking the input is exactly the job of parse_x! That's what I mean by "stealing work".

Suppose we do parse_zarr_format(data: Literal[2]) -> Literal[2]. This is the signature of a meaningless function, because if we know that the data is 2 then we don't need any parsing. But if we tried to use this function to deal with unknown data, where would our assurance that data == 2 come from? We would need a function to take that unknown data and check that it's 2, which is what parse_zarr_format is for in the first place.

@dstansby
Copy link
Contributor

Ah that makes sense, thanks for the patient explanation 👍 I didn't realise these were used to parse arbitrary JSON.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 13, 2024

Ah that makes sense, thanks for the patient explanation 👍 I didn't realise these were used to parse arbitrary JSON.

no worries! thanks for looking things over carefully.

In terms of the design of the parse_x functions, I would love an alternative to exceptions for handling data that can't be coerced into the output type, e.g. returning sum type that wraps the parsed value or a failure, but that's not a python thing right now

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 13, 2024

continuing this conversation in the main thread:

I removed the inheritance from Metadata, as well as the to_dict and from_dict methods for the codec pipeline classes, and everything worked 🤷 so those methods were not load bearing. I also renamed the CodecPipeline.from_list method to from_codecs, because a) it doesn't just take list (it takes iterable), and b) it's more informative to name the method after the items in the collection, rather than the collection itself.

@normanrz your eyes would be appreciated here

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 @d-v-b - fantastic set of simplifications here!

@d-v-b d-v-b merged commit 8c5038a into zarr-developers:v3 Sep 17, 2024
4 of 10 checks passed
@d-v-b d-v-b deleted the refactor/to_dict-returns-dicts branch September 17, 2024 16:55
@TomAugspurger
Copy link
Contributor

This seems to be causing tests/v3/test_array.py::test_serializable_sync_array to fail. I'm taking a look.

@@ -315,15 +313,9 @@ async def _create_v2(
chunks=chunks,
order=order,
dimension_separator=dimension_separator,
fill_value=0 if fill_value is None else fill_value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restoring the 0 if fill_value is None else fill_value does fix the failure in tests/v3/test_array.py::test_serializable_sync_array.

But that said, I like the change here... Setting a default fill value here seems wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but this is just for v2 metadata, which is allowed to be None. Maybe this isn't the wrong spot to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants