Separate behavior classes from types #2529

agoose77 · 2023-06-16T13:49:03Z

agoose77
Jun 16, 2023
Maintainer

One of the challenges here is the coupling between behavior classes (via __array__ and __record__ parameters) and array types (__array__ = "string", ...). The intention that string classes be user-customisable without requiring a significant rework of how far behaviors propagate into the codebase is one motivation, but another is generally figuring out how we think about the __array__ parameters vs other type parameters.

I think it would be helpful to start with describing some use cases:

Implementing a custom behavior for a categorical array
Implementing a custom behavior for a string, e.g. IP address
Implementing a custom behavior for an array with units

To compare "types" between two arrays, we have the following:

NumpyArray — do both arrays have the same primitive?
RecordArray — do both arrays have the same __record__ and are their structures type-comparable?
ListOffsetArray, ..., — do both arrays have the same __array__, and are their contents type-comparable?

It's clear to me that __array__ and __record__ are quite different, beyond their association with different content classes. Unlike __record__, __array__ is used to implement non-behavior customisation that precludes the use of custom behaviors.

I propose we introduce a new __kind__ parameter that can currently be one of ("string", "categorical"). We can introduce constraints upon which contents can be assigned with which values of __kind__. Introducing this parameter would allow us to separate user-provided names from built-in names, permitting custom strings and categoricals, e.g.

my_string = ak.contents.ListOffsetArray(
    ak.index.Index64([0, 3, ..., 9]),
    ak.contents.NumpyArray(
        np.array(..., dtype=np.uint8)
    ),
    parameters={
        "__kind__": "string",
        "__array__": "ip-address"
    }
)

The nominal type precedence is then:

__array__
__kind__

and is used to resolve behaviors. Meanwhile, string-specific features check __kind__, or fall back upon __array__ for legacy strings.

In this new formulation, we have the following interpretation of each parameter:

__array__ — the nominal type of a list (OR, the kind of a "legacy" string or categorical)
__record__ — the nominal type of a record
__kind__ — the built-in nominal type of a list (e.g. string or categorical)

Built-in types (strings, categoricals) look at __kind__, but high-level features like behaviors look at the resolved nominal type, i.e. via the precedence above.

Units (#2468) are a related problem; we can either implement them as a user-behavior, or as a low-level integration.

If units were implemented using behaviors, e.g.

>>> x = ak.Array(
...     [1, 2, 3, 4],
...     parameters={
...         "__array__": "unit",
...         "__unit__": "s",
...     },
... )
>>> y = ak.Array(
...     [1000, 2000, 3000, 4000],
...     parameters={
...         "__array__": "unit",
...         "__unit__": "ms",
...     },
... )
>>> def add_units(left, right):
...     ...
>>> ak.behavior[np.add, "unit", "unit"] = add_units
>>> x + y
ak.Array(
    [2000, 4000, 6000, 8000], 
    parameters={
        "__array__": "time",
        "__unit__": "ms",
    }
)

then it would not be possible to define a custom class for this array without re-overloading all of the ufunc methods that implement the units system. I can't immediately see what that would be useful for, but it doesn't feel that the behavior class is strongly related to the units system. Just as in this new system, string features check __kind__, I think units conversion should happen in the ufunc dispatch, and look exclusively at __unit__, i.e.

Do all arrays have the same nominal type (__array__ → __kind__)?
Convert all arrays to a common unit, if any arrays have a unit (or error if not possible)
Invoke any behavior overload for the ufunc

Ultimately, all of this represents a different mindset; behaviours are for users, and we should only use behaviors internally iff. the new feature(s) aren't intended to compose with other features. I feel that strings are orthogonal to custom user-types, and that units are also orthogonal to custom user-types.

Tagging @jpivarski for visibility

jpivarski · 2023-06-16T22:08:54Z

jpivarski
Jun 16, 2023
Maintainer

Yeah, __array__ and __record__ are becoming more different all the time. Most of the uses of __array__ that we're thinking about are very special, in ways that __record__ overloading is not.

When we've talked about implementing __units__, I had been thinking of it as completely separate from __array__ or an __array__ replacement. It only applies to array-like things (not record-like things); in particular, __units__ only applies to PrimitiveType (NumpyArray). I don't think it's a special kind of __array__ overload—it's extremely special. (We know how to implement it for all cases: look up some info in a Pint registry and scalar-multiply.)

Categorical is also its own thing, applying only to is_indexed node types (IndexedArray and IndexedOptionArray).

The special behavior we're pulling out of generic __array__ overloads are modifications to strings and bytestrings. That is, we're reducing the generality of __array__ overloading to just string/bytestring-overloading. That's still very general; it's what a database would call an opaque binary blob type (overloading bytestring, specifically).

I see that you're trying to pull together similar things to give them a common implementation, but

__array__ overloading is not an active part of the ecosystem. (I think it's never been done, apart from defining strings.)
The other cases, units and categorical, are not very much like string/bytestring overloading.
We're somewhat constrained by backward compatibility: categorical and string/bytestring are implemented in a particular way, and I don't want to complicate things by having to support both a legacy way and a new way.

Why can't they be three special cases? They all apply to different sets of nodes:

units: only is_numpy
categorical: only is_indexed of anything
strings/bytestrings: only is_list of is_numpy

They seem to be three different things to me.

0 replies

agoose77 · 2023-06-17T22:01:16Z

agoose77
Jun 17, 2023
Maintainer Author

array overloading is not an active part of the ecosystem. (I think it's never been done, apart from defining strings.)

I've used this (as a user), but I am in agreement that this is unlikely.

Why can't they be three special cases? They all apply to different sets of nodes:

I think they mostly should be — at least, that's what I propose here.

The other cases, units and categorical, are not very much like string/bytestring overloading.

__unit__ is special (and yes, per NumpyArray. What unifies categorical and string is the fact that these are abstractions over existing layout nodes. That's what __kind__ signifies — built-in abstractions over the existing layout nodes.

I'd like for users to be able to add their own methods to these things via behaviors. That means either

moving these attributes to the layout type itself (e.g. a new initialiser argument is_string etc.),
creating a new reserved property for these abstractions(__kind__), or
creating a new reserved property for non-record type names (e.g. __name__).

In particular, I think units, strings, and categoricals have group-properties that should transcend their nominal type. That's what a custom string should have; now that strings are built in, users should have a way to add their own named strings without needing to reimplement everything.

4 replies

jpivarski Jun 19, 2023
Maintainer

Okay, I see the logic of that (__kind__). But would existing strings need to change? I really, really don't want to have to explain that we have new-style strings and legacy strings (and categoricals, though I think strings are a lot more likely).

agoose77 Jun 19, 2023
Maintainer Author

I am also not against making __array__ be __kind__, and adding a new __name__. In fact, that might be a better choice of names than __kind__. In either case, we can leave existing non-named strings as {"__array__": "string"} — only custom strings / categoricals would need to change. With that in mind, for symmetry it might be preferable to introduce __name__, rather than __kind__, so that custom strings look very similar to un-named strings.

Adapting the statement above:

In this new formulation, we have the following interpretation & precedence of each parameter:

__name__ the user-provided nominal type of a list
__array__ — the built-in semantic type of a list, e.g. string or categorical
__record__ — the nominal type of a record

This would be an unnamed string:

my_string = ak.contents.ListOffsetArray(
    ak.index.Index64([0, 3, ..., 9]),
    ak.contents.NumpyArray(
        np.array(..., dtype=np.uint8)
    ),
    parameters={
        "__array__": "string"
    }
)

whilst this one has a name:

my_string = ak.contents.ListOffsetArray(
    ak.index.Index64([0, 3, ..., 9]),
    ak.contents.NumpyArray(
        np.array(..., dtype=np.uint8)
    ),
    parameters={
        "__array__": "string",
        "__name__": "ip-address"
    }
)

jpivarski Jun 20, 2023
Maintainer

This is good because it extends from the current state. The meanings of "name" are perhaps too broad, maybe __overload__, __override__, or __behavior__? It begs the question about why __record__ is different, since that also overloads/overrides behavior, but that is different because when you set __record__, it could assign an ak.Record subclass if the level is at a record but it could also assign an ak.Array subclass if it's some level above that record.

agoose77 Jun 20, 2023
Maintainer Author

__behavior__ is perhaps too restrictive, too; it's not just a key in the behavior lookup, it is also a nominal type. An intended consequence of this is that merging string against ip-address would be considered invalid.

Thus, I think __record__ is justified in not having the name __behavior__: it affects both the choice of behavior class and the set of operations that one can perform against it (i.e., whether two record arrays can be concatenated together w/o a union is determined by whether they share the same name). I'd like the same to be true for strings and categoricals; that their name is both a nominal type, and the lookup for the behavior class. This change would mean that __array__ becomes a more restricted parameter that is used only to tell Awkward whether a list is a string or an indexed node is a categorical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate behavior classes from types #2529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Separate behavior classes from types #2529

agoose77 Jun 16, 2023 Maintainer

Replies: 2 comments · 4 replies

jpivarski Jun 16, 2023 Maintainer

agoose77 Jun 17, 2023 Maintainer Author

jpivarski Jun 19, 2023 Maintainer

agoose77 Jun 19, 2023 Maintainer Author

jpivarski Jun 20, 2023 Maintainer

agoose77 Jun 20, 2023 Maintainer Author

agoose77
Jun 16, 2023
Maintainer

Replies: 2 comments 4 replies

jpivarski
Jun 16, 2023
Maintainer

agoose77
Jun 17, 2023
Maintainer Author

jpivarski Jun 19, 2023
Maintainer

agoose77 Jun 19, 2023
Maintainer Author

jpivarski Jun 20, 2023
Maintainer

agoose77 Jun 20, 2023
Maintainer Author