Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Series.struct accessor with Series.struct.field("sub-column name") for ArrowDtype #54938

Closed
1 of 3 tasks
tswast opened this issue Sep 1, 2023 · 2 comments · Fixed by #54977
Closed
1 of 3 tasks
Labels
Accessors accessor registration mechanism (not .str, .dt, .cat) Arrow pyarrow functionality Enhancement
Milestone

Comments

@tswast
Copy link
Contributor

tswast commented Sep 1, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When I have a Series of type ArrowDtype(struct(...)), I'd like to be able to extract sub-fields from them.

For example, I have a pandas Series with the ArrowDtype(pyarrow.struct([("int_col", pyarrow.int64()), ("string_col", pyarrow.string())])). I'd like to extract just the int_col field from this Series as another Series.

Feature Description

Add a struct accessor which is accessible from Series with ArrowDtype(struct(...)). This struct accessor provides a field() method which returns a Series containing only the specified sub-field.

series = pandas.Series(struct_array, dtype=pandas.ArrowDtype(struct_type))

int_series = series.struct.field("int_col")

Alternative Solutions

I can currently do this via pyarrow.compute.struct_field on the underlying pyarrow array:

import pyarrow
struct_type = pyarrow.struct([
    ("int_col", pyarrow.int64()),
    ("string_col", pyarrow.string()),
])
struct_array = pyarrow.array([
    {"int_col": 1, "string_col": "a"},
    {"int_col": 2, "string_col": "b"},
    {"int_col": 3, "string_col": "c"},
], type=struct_type)

import pandas
series = pandas.Series(struct_array, dtype=pandas.ArrowDtype(struct_type))

int_col_index = struct_array.type.get_field_index("int_col")
int_col_series = pandas.Series(
    pyarrow.compute.struct_field(struct_array, [int_col_index]),
    dtype=pandas.ArrowDtype(struct_array.type[int_col_index].type))

Additional Context

This issue is particularly relevant when working with data sources that support struct fields, such as BigQuery or Parquet.

@tswast tswast added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2023
@rhshadrach rhshadrach added the Arrow pyarrow functionality label Sep 1, 2023
@phofl phofl added Accessors accessor registration mechanism (not .str, .dt, .cat) and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 2, 2023
@tswast
Copy link
Contributor Author

tswast commented Sep 2, 2023

Another potentially useful method would be Series.struct.to_frame() to create a DataFrame with all sub-fields as columns.

@tswast
Copy link
Contributor Author

tswast commented Sep 3, 2023

Still needs some tests, docstrings, examples, and whatsnew, but I'm quite pleased with some rapid progress in draft PR #54977 if anyone want to take a peek before it's ready.

It's been fun playing with it in some manual testing:

In [1]: import pyarrow
   ...: struct_type = pyarrow.struct([
   ...:     ("int_col", pyarrow.int64()),
   ...:     ("string_col", pyarrow.string()),
   ...: ])
   ...: struct_array = pyarrow.array([
   ...:     {"int_col": 1, "string_col": "a"},
   ...:     {"int_col": 2, "string_col": "b"},
   ...:     {"int_col": 3, "string_col": "c"},
   ...: ], type=struct_type)
   ...: 
   ...: import pandas
   ...: series = pandas.Series(struct_array, dtype=pandas.ArrowDtype(struct_type))

In [3]: series.struct.dtypes
Out[3]: 
int_col        int64[pyarrow]
string_col    string[pyarrow]
dtype: object

In [4]: series.struct.field("string_col")
Out[4]: 
0    a
1    b
2    c
Name: string_col, dtype: string[pyarrow]

In [5]: series.struct.to_frame()
Out[5]: 
   int_col string_col
0        1          a
1        2          b
2        3          c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accessors accessor registration mechanism (not .str, .dt, .cat) Arrow pyarrow functionality Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants