Sub-byte custom element types #63

hawkinsp · 2023-05-03T02:35:21Z

I'm wondering if anyone has considered sub-byte user types as part of the API design. The assumption that types have element sizes that are integer numbers of bytes seems to be baked into current user dtypes design.

For example, consider a packed int4 type, whose elements are nibbles (half-bytes). It doesn't seem possible to express such a type in the user dtype API, e.g., the type descriptor requires a byte size for each element, and ufuncs require byte strides. I was toying with the idea of trying to write such a type, but it doesn't seem to be possible without API changes.

I see this was mentioned briefly in NEP 41 (int2) but there is no further discussion I can see. Has there been any more thinking along those lines?

The text was updated successfully, but these errors were encountered:

seberg · 2023-05-03T06:52:32Z

Well, there was always the issue of ABI. Now, we can change ABI for the DType at least in 2.0 and I am planning to do it to make additions easier. So representing a bit-sized dtype should be relatively straight forward with that. Maybe as:

A flag or even a negative size or so.
Keep byte-size, but additionally indicate a bit size.

The bigger problem is how you actually work with it. We pass in dtypes and some state to ufuncs/cast, etc. would a dtype indicating bit-sized always get passed bit pointers and bit-strides?

Right now, we pass around pointers such as char *. For 32bit systems, I think you would have to change this to be a 64bit number (strides/array sizes seem acceptable at 32bit, but a 32bit pointer won't do).
We could do that, would it be annoying for normal byte-loops?

So a lot of questions, and I don't have all the answer:

Lets say we support for the numpy array to store such a dtype directly, how would you do that? Still use a normal pointer that remains valid, but indicate the bit-offset somewhere (i.e. as a last stride?)
The main question: How do we make it that casts/ufunc signatures can work with it decently. Let's not worry about NumPy itself there!
- I would like if your int4 would very transparently work as an Int4(bitsize=8) so that you could store it into a byte-strided array easily. But if that is to give convenience, it may be that your dtype would need bit and byte loops?
- Or would you actually have bit- and byte- loops with distinct signatures? We would just have a separate mechanism to fetch the bit sized versions (or a flag asking for it?). (We could indicate it additionally, so you could have a single loop in principle.)

I also don't think you could update all of NumPy functionality on a reasonable timeline without hiring 1-2 people dedicated to it. So we could do additions to the array object to allow it in principle and even get it to work with most things. But I doubt you will get it to work with everything quickly (which is fine by me, you could just add a NotImplementedError).

Anyway, questions... The interesting part might be if we can formulate anything that affects the array object or the inner-loop signatures for ufuncs/casts.

hawkinsp · 2023-05-12T20:27:19Z

jax-ml/ml_dtypes#71 is a prototype of adding a padded int4 NumPy type to ml_dtypes. Since we cannot represent sub-byte types, the best we can do is use an int4 type where each element is padded up to a byte.

This is good enough for my use case, although the padding makes me a little bit sad.

ngoldbaum mentioned this issue Jul 4, 2023

PDEP-10: Add pyarrow as a required dependency pandas-dev/pandas#52711

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-byte custom element types #63

Sub-byte custom element types #63

hawkinsp commented May 3, 2023

seberg commented May 3, 2023

hawkinsp commented May 12, 2023

Sub-byte custom element types #63

Sub-byte custom element types #63

Comments

hawkinsp commented May 3, 2023

seberg commented May 3, 2023

hawkinsp commented May 12, 2023