Implement first-class List type #60629

WillAyd · 2024-12-30T20:58:17Z

Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all nulls that is separately paired with a type oflist[string]

mroeschke · 2024-12-30T22:14:15Z

Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations

WillAyd · 2024-12-30T22:48:48Z

PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system.

We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032

jbrockmendel · 2024-12-31T03:14:19Z

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained.

Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives.

jbrockmendel · 2024-12-31T03:15:13Z

pandas/core/internals/blocks.py

+        try:
+            return self.values.dtype
+        except AttributeError:  # PyArrow fallback
+            return self.values.type


This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.

Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!

jbrockmendel · 2024-12-31T03:15:49Z

pandas/core/internals/blocks.py

+    if dtype:
+        klass = get_block_type(dtype)
+    else:
+        klass = get_block_type(values.dtype)


as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.

jbrockmendel · 2024-12-31T03:16:51Z

pandas/core/series.py

@@ -505,7 +505,7 @@ def __init__(
                data = data.copy()
        else:
            data = sanitize_array(data, index, dtype, copy)
-            data = SingleBlockManager.from_array(data, index, refs=refs)
+            data = SingleBlockManager.from_array(data, dtype, index, refs=refs)


if dtype is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary

jbrockmendel · 2024-12-31T03:17:42Z

pandas/io/formats/format.py

@@ -1103,7 +1103,11 @@ def format_array(
    List[str]
    """
    fmt_klass: type[_GenericArrayFormatter]
-    if lib.is_np_dtype(values.dtype, "M"):
+    if hasattr(values, "type") and values.type == "null":


can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?

jbrockmendel · 2024-12-31T03:19:19Z

pandas/tests/extension/list/test_list.py


-    return ListArray(data)
+class TestListArray(BaseConstructorsTests): ...


i think we moved away from this pattern to just ExtensionTests

Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode

jbrockmendel · 2024-12-31T03:23:15Z

the thing ive always assumed would be a PITA with a ListDtype is __setitem__ distinguishing whether df.iloc[x, :3] = [a, b, c] is to behave like

df.iloc[x, 0] = a
df.iloc[x, 1] = b
df.iloc[x, 2] = c

vs

df.iloc[x, 0] = [a, b, c]
df.iloc[x, 1] = [a, b, c]
df.iloc[x, 2] = [a, b, c]

have you given any thought to that?

WillAyd · 2024-12-31T14:04:02Z

For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists

jbrockmendel · 2025-01-02T20:46:10Z

pandas/core/internals/construction.py

@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool:
        len(data) > 0
        and is_list_like(data[0])
        and getattr(data[0], "ndim", 1) == 1
+        # TODO(wayd): hack so pyarrow list elements don't expand
+        and not isinstance(data[0], pa.ListScalar)


I think have is list like return False for pyarrow scalar is less hacky?

That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like. Will do more research

jbrockmendel · 2025-01-02T20:46:45Z

pandas/core/series.py

@@ -494,7 +495,7 @@ def __init__(
            if not is_list_like(data):
                data = [data]
            index = default_index(len(data))
-        elif is_list_like(data):
+        elif is_list_like(data) and not isinstance(dtype, ListDtype):


What about nested list?

Yea this is a tough one to handle. I'm not sure if something like:

pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())

should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array

Implement first-class List type

c55bc0a

jbrockmendel reviewed Dec 31, 2024

View reviewed changes

WillAyd added 3 commits December 31, 2024 09:12

Brock feedback

66d8a1d

Test cleanups

ef378f7

Fix API tests

e25c0d4

WillAyd force-pushed the implement-list-type branch from ad6fa08 to e25c0d4 Compare December 31, 2024 19:06

Progress to base.ExtensionArray tests

21a69c9

jbrockmendel reviewed Jan 2, 2025

View reviewed changes

Improve test coverage

5859e96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement first-class List type #60629

Implement first-class List type #60629

WillAyd commented Dec 30, 2024 •

edited

Loading

mroeschke commented Dec 30, 2024

WillAyd commented Dec 30, 2024

jbrockmendel commented Dec 31, 2024

jbrockmendel Dec 31, 2024

WillAyd Dec 31, 2024

jbrockmendel Dec 31, 2024

jbrockmendel Dec 31, 2024

jbrockmendel Dec 31, 2024

jbrockmendel Dec 31, 2024

WillAyd Dec 31, 2024

jbrockmendel commented Dec 31, 2024

WillAyd commented Dec 31, 2024

jbrockmendel Jan 2, 2025

WillAyd Jan 2, 2025

jbrockmendel Jan 2, 2025

WillAyd Jan 2, 2025


		return ListArray(data)
		class TestListArray(BaseConstructorsTests): ...

Implement first-class List type #60629

Are you sure you want to change the base?

Implement first-class List type #60629

Conversation

WillAyd commented Dec 30, 2024 • edited Loading

mroeschke commented Dec 30, 2024

WillAyd commented Dec 30, 2024

jbrockmendel commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 31, 2024

WillAyd commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Dec 30, 2024 •

edited

Loading