Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add Series.struct accessor for ArrowDtype[struct] #54977

Merged
merged 6 commits into from
Sep 18, 2023

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Sep 3, 2023

Features:

  • Series.struct.dtypes -- see dtypes and field names
  • Series.struct.field(name_or_index) -- extract a field as a Series
  • Series.struct.to_frame() -- convert all fields into a DataFrame

Closes #54938

@tswast tswast force-pushed the issue54938-struct-accessor branch from 66ff669 to 3996cb3 Compare September 4, 2023 03:10
@tswast tswast marked this pull request as ready for review September 4, 2023 03:10
@tswast tswast force-pushed the issue54938-struct-accessor branch 2 times, most recently from 1fd8d43 to ae4088a Compare September 4, 2023 12:16
@tswast tswast force-pushed the issue54938-struct-accessor branch 2 times, most recently from 00dff2c to adc9e26 Compare September 5, 2023 11:18
@tswast tswast added Enhancement Arrow pyarrow functionality Accessors accessor registration mechanism (not .str, .dt, .cat) labels Sep 6, 2023
@tswast tswast added this to the 2.2 milestone Sep 6, 2023
def _validate(self, data):
dtype = data.dtype
if not isinstance(dtype, ArrowDtype):
raise AttributeError(self._validation_msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think TypeError would be more appropriate here.

Also could we include the incoming type in the error message like "f..., not {dtype}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AttributeError was likely preventing this test failure.

pandas/tests/series/test_api.py:175: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../micromamba/envs/test/lib/python3.9/inspect.py:351: in getmembers
    value = getattr(object, key)
pandas/core/accessor.py:224: in __get__
    accessor_obj = self._accessor(obj)
pandas/core/arrays/arrow/accessors.py:38: in __init__
    self._validate(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.core.arrays.arrow.accessors.StructAccessor object at 0x7ff513c9cbe0>
data = Series([], dtype: object)

    def _validate(self, data):
        dtype = data.dtype
        if not isinstance(dtype, ArrowDtype):
>           raise TypeError(self._validation_msg.format(dtype=dtype))
E           TypeError: Can only use the '.struct' accessor with 'struct[pyarrow]' dtype, not object.

pandas/core/arrays/arrow/accessors.py:43: TypeError

@tswast tswast force-pushed the issue54938-struct-accessor branch from b19ab49 to 2f64ded Compare September 8, 2023 23:26
@tswast tswast requested a review from mroeschke September 8, 2023 23:26
@tswast
Copy link
Contributor Author

tswast commented Sep 8, 2023

@mroeschke Thanks for the feedback! I believe I've now addressed all of your comments.

pandas/core/arrays/arrow/accessors.py Outdated Show resolved Hide resolved
pandas/core/arrays/arrow/accessors.py Outdated Show resolved Hide resolved
pandas/tests/series/accessors/test_struct_accessor.py Outdated Show resolved Hide resolved
@tswast tswast force-pushed the issue54938-struct-accessor branch from e3dd736 to bbf9c72 Compare September 9, 2023 02:41
:meth:`Series.struct.explode` converts PyArrow structured data to a pandas
DataFrame. (:issue:`54938`)

.. code-block:: ipython
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a .. ipython:: python directive here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done. While I was at it, I updated the sample to be more meaningful, per https://pandas.pydata.org/docs/development/contributing_docstring.html#conventions-for-the-examples

Screenshot 2023-09-11 at 2 43 07 PM

This example is taken from the "file" struct schema in https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/#data-schema

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments otherwise looks good!

Features:

* Series.struct.dtypes -- see dtypes and field names
* Series.struct.field(name_or_index) -- extract a field as a Series
* Series.struct.explode() -- convert all fields into a DataFrame
@tswast tswast force-pushed the issue54938-struct-accessor branch from 1524dfb to d47b6f9 Compare September 11, 2023 19:47
@tswast tswast requested a review from mroeschke September 11, 2023 19:48
names = [struct.name for struct in pa_type]
return Series(types, index=Index(names))

def field(self, name_or_index: str | int) -> Series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that https://arrow.apache.org/docs/python/generated/pyarrow.compute.struct_field.html says you can do things like struct_field(array, indices=['a', 'b']) to do nested lookups like a.b. Any reason to disallow that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Extracting the name to use for the Series with such inputs is turning out to be non-trivial though. Perhaps best left as a follow-up enhancement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up makes sense to me.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will keep open for a few day just in case there's other feedback

@TomAugspurger
Copy link
Contributor

@mroeschke OK to merge now?

@mroeschke mroeschke merged commit 36aa531 into pandas-dev:main Sep 18, 2023
32 of 33 checks passed
@mroeschke
Copy link
Member

Nice thanks @tswast

@tswast tswast deleted the issue54938-struct-accessor branch September 18, 2023 16:50
hedeershowk pushed a commit to hedeershowk/pandas that referenced this pull request Sep 20, 2023
)

Features:

* Series.struct.dtypes -- see dtypes and field names
* Series.struct.field(name_or_index) -- extract a field as a Series
* Series.struct.explode() -- convert all fields into a DataFrame
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accessors accessor registration mechanism (not .str, .dt, .cat) Arrow pyarrow functionality Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Series.struct accessor with Series.struct.field("sub-column name") for ArrowDtype
3 participants