-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Aggregation on arrow array return same type #53717
Conversation
result = df.groupby("category").agg(lambda x: x.sum() / x.count()) | ||
expected = DataFrame( | ||
[[0.5, 0.5], [0.5, 0.5]], | ||
columns=["bool_numpy", "bool_arrow"], | ||
index=Index(["A", "B"], name="category"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the bool_arrow result here be double[pyarrow]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is float64. The result DataFrame's bool_arrow is also float64. So in age_series function, if preserve_type is False, the pyarrow result would be same as numpy result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking what it should be. If I add a collection of bool[pyarrow]
, I expect to get int64[pyarrow]
. Likewise, if I divide two int64[pyarrow]
, I expect to get double[pyarrow]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah. It should be double[pyarrow]. So we need to cast the numpy result to pyarrow types. I am thinking the current cast function "maybe_cast_pointwise_result" need to cast numpy result array to corresponding pyarrow array, not original dtype.
73ded8b
to
6028670
Compare
pandas/core/groupby/ops.py
Outdated
if ( | ||
len(obj) > 0 | ||
and not isinstance(obj._values, np.ndarray) | ||
and not isinstance(obj._values, ArrowExtensionArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this a great long term solution. I think this needs a more general solution so that this just doesn't apply to agg_series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a general solution should be in cast function to handle the pyarrow situation. I will update the pr later.
d979c10
to
aef1307
Compare
would #53089 address this? in general we want to get logic out of maybe_cast_pointwise_result |
@jbrockmendel , I have tried pr53089. It can't fix issue 53030. But I can try to work on pr53089 to fix issue 53030. |
|
||
|
||
@pytest.mark.skipif( | ||
not typing.TYPE_CHECKING, reason="TYPE_CHECKING must be True to import pyarrow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason here doesn't seem right to me. are you sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated it to make the reason more clear. Please review it.
2450800
to
5965018
Compare
cls = pyarrow_type.construct_array_type() | ||
result = _maybe_cast_to_extension_array(cls, result) | ||
else: | ||
cls = dtype.construct_array_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what cases get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_dtypes(result, dtype_backend="pyarrow") function sometimes will return a np.dtype. So I used current codes to handle this situation. Let me do more investigation to get root cause and better solution.
|
||
|
||
@pytest.mark.skipif( | ||
not typing.TYPE_CHECKING, reason="let pyarrow to be imported in dtypes.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont understand the reason at all. what breaks if you remove this "skipif" entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some environment, this test case will fail. The error log is as below:
pandas/tests/groupby/aggregate/test_aggregate.py:1619:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/generic.py:6487: in astype
elif is_extension_array_dtype(dtype) and self.ndim > 1:
pandas/core/dtypes/common.py:1319: in is_extension_array_dtype
return registry.find(dtype) is not None
pandas/core/dtypes/base.py:522: in find
return dtype_type.construct_from_string(dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class 'pandas.core.dtypes.dtypes.ArrowDtype'>, string = 'bool[pyarrow]'
@classmethod
def construct_from_string(cls, string: str) -> ArrowDtype:
"""
Construct this type from a string.
Parameters
----------
string : str
string should follow the format f"{pyarrow_type}[pyarrow]"
e.g. int64[pyarrow]
"""
if not isinstance(string, str):
raise TypeError(
f"'construct_from_string' expects a string, got {type(string)}"
)
if not string.endswith("[pyarrow]"):
raise TypeError(f"'{string}' must end with '[pyarrow]'")
if string == "string[pyarrow]":
# Ensure Registry.find skips ArrowDtype to use StringDtype instead
raise TypeError("string[pyarrow] should be constructed by StringDtype")
base_type = string[:-9] # get rid of "[pyarrow]"
try:
> pa_dtype = pa.type_for_alias(base_type)
E NameError: name 'pa' is not defined
So I added skipif to check if TYPE_CHECKING is true then pyarrow will be imported in core/dtypes/dtypes.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the decorator you want for this test is @skip_if_no_pyarrow
466c502
to
d1a8378
Compare
@jbrockmendel , I have done some debug, found something about convert_dtypes function.
At this time, inferred_dtype is object dtype. |
what if you do infer_objects() followed by convert_dtypes? |
I have tried using infer_objects with following codes:
It doesn't work. The dtype of result still can be object after infer_objects. |
Signed-off-by: Liang Yan <[email protected]>
Signed-off-by: Liang Yan <[email protected]>
Signed-off-by: Liang Yan <[email protected]>
Signed-off-by: Liang Yan <[email protected]>
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.