-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement first-class List type #60629
base: main
Are you sure you want to change the base?
Conversation
Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations |
PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system. We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032 |
Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives. |
pandas/core/internals/blocks.py
Outdated
try: | ||
return self.values.dtype | ||
except AttributeError: # PyArrow fallback | ||
return self.values.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!
pandas/core/internals/blocks.py
Outdated
if dtype: | ||
klass = get_block_type(dtype) | ||
else: | ||
klass = get_block_type(values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.
pandas/core/series.py
Outdated
@@ -505,7 +505,7 @@ def __init__( | |||
data = data.copy() | |||
else: | |||
data = sanitize_array(data, index, dtype, copy) | |||
data = SingleBlockManager.from_array(data, index, refs=refs) | |||
data = SingleBlockManager.from_array(data, dtype, index, refs=refs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dtype
is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary
pandas/io/formats/format.py
Outdated
@@ -1103,7 +1103,11 @@ def format_array( | |||
List[str] | |||
""" | |||
fmt_klass: type[_GenericArrayFormatter] | |||
if lib.is_np_dtype(values.dtype, "M"): | |||
if hasattr(values, "type") and values.type == "null": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?
|
||
return ListArray(data) | ||
class TestListArray(BaseConstructorsTests): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we moved away from this pattern to just ExtensionTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode
the thing ive always assumed would be a PITA with a ListDtype is
vs
have you given any thought to that? |
For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists |
ad6fa08
to
e25c0d4
Compare
@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool: | |||
len(data) > 0 | |||
and is_list_like(data[0]) | |||
and getattr(data[0], "ndim", 1) == 1 | |||
# TODO(wayd): hack so pyarrow list elements don't expand | |||
and not isinstance(data[0], pa.ListScalar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think have is list like return False for pyarrow scalar is less hacky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like
. Will do more research
@@ -494,7 +495,7 @@ def __init__( | |||
if not is_list_like(data): | |||
data = [data] | |||
index = default_index(len(data)) | |||
elif is_list_like(data): | |||
elif is_list_like(data) and not isinstance(dtype, ListDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about nested list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is a tough one to handle. I'm not sure if something like:
pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())
should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array
Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.
The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all
nulls
that is separately paired with a type oflist[string]