Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first-class List type #60629

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Dec 30, 2024

Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all nulls that is separately paired with a type oflist[string]

@mroeschke
Copy link
Member

Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations

@WillAyd
Copy link
Member Author

WillAyd commented Dec 30, 2024

PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system.

We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032

@jbrockmendel
Copy link
Member

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained.

Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives.

try:
return self.values.dtype
except AttributeError: # PyArrow fallback
return self.values.type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!

if dtype:
klass = get_block_type(dtype)
else:
klass = get_block_type(values.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.

@@ -505,7 +505,7 @@ def __init__(
data = data.copy()
else:
data = sanitize_array(data, index, dtype, copy)
data = SingleBlockManager.from_array(data, index, refs=refs)
data = SingleBlockManager.from_array(data, dtype, index, refs=refs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dtype is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary

@@ -1103,7 +1103,11 @@ def format_array(
List[str]
"""
fmt_klass: type[_GenericArrayFormatter]
if lib.is_np_dtype(values.dtype, "M"):
if hasattr(values, "type") and values.type == "null":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?


return ListArray(data)
class TestListArray(BaseConstructorsTests): ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we moved away from this pattern to just ExtensionTests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode

@jbrockmendel
Copy link
Member

the thing ive always assumed would be a PITA with a ListDtype is __setitem__ distinguishing whether df.iloc[x, :3] = [a, b, c] is to behave like

df.iloc[x, 0] = a
df.iloc[x, 1] = b
df.iloc[x, 2] = c

vs

df.iloc[x, 0] = [a, b, c]
df.iloc[x, 1] = [a, b, c]
df.iloc[x, 2] = [a, b, c]

have you given any thought to that?

@WillAyd
Copy link
Member Author

WillAyd commented Dec 31, 2024

For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists

@WillAyd WillAyd force-pushed the implement-list-type branch from ad6fa08 to e25c0d4 Compare December 31, 2024 19:06
@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool:
len(data) > 0
and is_list_like(data[0])
and getattr(data[0], "ndim", 1) == 1
# TODO(wayd): hack so pyarrow list elements don't expand
and not isinstance(data[0], pa.ListScalar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think have is list like return False for pyarrow scalar is less hacky?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like. Will do more research

@@ -494,7 +495,7 @@ def __init__(
if not is_list_like(data):
data = [data]
index = default_index(len(data))
elif is_list_like(data):
elif is_list_like(data) and not isinstance(dtype, ListDtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is a tough one to handle. I'm not sure if something like:

pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())

should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants