-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add Series.struct accessor for ArrowDtype[struct] #54977
Conversation
66ff669
to
3996cb3
Compare
1fd8d43
to
ae4088a
Compare
00dff2c
to
adc9e26
Compare
def _validate(self, data): | ||
dtype = data.dtype | ||
if not isinstance(dtype, ArrowDtype): | ||
raise AttributeError(self._validation_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think TypeError
would be more appropriate here.
Also could we include the incoming type in the error message like "f..., not {dtype}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AttributeError was likely preventing this test failure.
pandas/tests/series/test_api.py:175:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../micromamba/envs/test/lib/python3.9/inspect.py:351: in getmembers
value = getattr(object, key)
pandas/core/accessor.py:224: in __get__
accessor_obj = self._accessor(obj)
pandas/core/arrays/arrow/accessors.py:38: in __init__
self._validate(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandas.core.arrays.arrow.accessors.StructAccessor object at 0x7ff513c9cbe0>
data = Series([], dtype: object)
def _validate(self, data):
dtype = data.dtype
if not isinstance(dtype, ArrowDtype):
> raise TypeError(self._validation_msg.format(dtype=dtype))
E TypeError: Can only use the '.struct' accessor with 'struct[pyarrow]' dtype, not object.
pandas/core/arrays/arrow/accessors.py:43: TypeError
b19ab49
to
2f64ded
Compare
@mroeschke Thanks for the feedback! I believe I've now addressed all of your comments. |
e3dd736
to
bbf9c72
Compare
doc/source/whatsnew/v2.2.0.rst
Outdated
:meth:`Series.struct.explode` converts PyArrow structured data to a pandas | ||
DataFrame. (:issue:`54938`) | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use a .. ipython:: python
directive here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done. While I was at it, I updated the sample to be more meaningful, per https://pandas.pydata.org/docs/development/contributing_docstring.html#conventions-for-the-examples
This example is taken from the "file" struct schema in https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/#data-schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments otherwise looks good!
Features: * Series.struct.dtypes -- see dtypes and field names * Series.struct.field(name_or_index) -- extract a field as a Series * Series.struct.explode() -- convert all fields into a DataFrame
1524dfb
to
d47b6f9
Compare
names = [struct.name for struct in pa_type] | ||
return Series(types, index=Index(names)) | ||
|
||
def field(self, name_or_index: str | int) -> Series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that https://arrow.apache.org/docs/python/generated/pyarrow.compute.struct_field.html says you can do things like struct_field(array, indices=['a', 'b'])
to do nested lookups like a.b
. Any reason to disallow that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Extracting the name to use for the Series with such inputs is turning out to be non-trivial though. Perhaps best left as a follow-up enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will keep open for a few day just in case there's other feedback
@mroeschke OK to merge now? |
Nice thanks @tswast |
Features:
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Closes #54938