Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Aggregation on arrow array return same type #53717

Closed
wants to merge 4 commits into from

Conversation

liang3zy22
Copy link
Contributor

Comment on lines 1616 to 1651
result = df.groupby("category").agg(lambda x: x.sum() / x.count())
expected = DataFrame(
[[0.5, 0.5], [0.5, 0.5]],
columns=["bool_numpy", "bool_arrow"],
index=Index(["A", "B"], name="category"),
)
Copy link
Member

@rhshadrach rhshadrach Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the bool_arrow result here be double[pyarrow]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is float64. The result DataFrame's bool_arrow is also float64. So in age_series function, if preserve_type is False, the pyarrow result would be same as numpy result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking what it should be. If I add a collection of bool[pyarrow], I expect to get int64[pyarrow]. Likewise, if I divide two int64[pyarrow], I expect to get double[pyarrow].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah. It should be double[pyarrow]. So we need to cast the numpy result to pyarrow types. I am thinking the current cast function "maybe_cast_pointwise_result" need to cast numpy result array to corresponding pyarrow array, not original dtype.

@liang3zy22 liang3zy22 force-pushed the gh53030 branch 2 times, most recently from 73ded8b to 6028670 Compare June 20, 2023 02:13
if (
len(obj) > 0
and not isinstance(obj._values, np.ndarray)
and not isinstance(obj._values, ArrowExtensionArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this a great long term solution. I think this needs a more general solution so that this just doesn't apply to agg_series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a general solution should be in cast function to handle the pyarrow situation. I will update the pr later.

@liang3zy22 liang3zy22 force-pushed the gh53030 branch 4 times, most recently from d979c10 to aef1307 Compare June 22, 2023 00:02
@jbrockmendel
Copy link
Member

would #53089 address this? in general we want to get logic out of maybe_cast_pointwise_result

@liang3zy22
Copy link
Contributor Author

would #53089 address this? in general we want to get logic out of maybe_cast_pointwise_result

@jbrockmendel , I have tried pr53089. It can't fix issue 53030. But I can try to work on pr53089 to fix issue 53030.



@pytest.mark.skipif(
not typing.TYPE_CHECKING, reason="TYPE_CHECKING must be True to import pyarrow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason here doesn't seem right to me. are you sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it to make the reason more clear. Please review it.

@liang3zy22 liang3zy22 force-pushed the gh53030 branch 2 times, most recently from 2450800 to 5965018 Compare July 1, 2023 07:53
cls = pyarrow_type.construct_array_type()
result = _maybe_cast_to_extension_array(cls, result)
else:
cls = dtype.construct_array_type()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what cases get here?

Copy link
Contributor Author

@liang3zy22 liang3zy22 Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_dtypes(result, dtype_backend="pyarrow") function sometimes will return a np.dtype. So I used current codes to handle this situation. Let me do more investigation to get root cause and better solution.



@pytest.mark.skipif(
not typing.TYPE_CHECKING, reason="let pyarrow to be imported in dtypes.py"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont understand the reason at all. what breaks if you remove this "skipif" entirely?

Copy link
Contributor Author

@liang3zy22 liang3zy22 Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some environment, this test case will fail. The error log is as below:


pandas/tests/groupby/aggregate/test_aggregate.py:1619: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/generic.py:6487: in astype
    elif is_extension_array_dtype(dtype) and self.ndim > 1:
pandas/core/dtypes/common.py:1319: in is_extension_array_dtype
    return registry.find(dtype) is not None
pandas/core/dtypes/base.py:522: in find
    return dtype_type.construct_from_string(dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'pandas.core.dtypes.dtypes.ArrowDtype'>, string = 'bool[pyarrow]'

    @classmethod
    def construct_from_string(cls, string: str) -> ArrowDtype:
        """
        Construct this type from a string.
    
        Parameters
        ----------
        string : str
            string should follow the format f"{pyarrow_type}[pyarrow]"
            e.g. int64[pyarrow]
        """
        if not isinstance(string, str):
            raise TypeError(
                f"'construct_from_string' expects a string, got {type(string)}"
            )
        if not string.endswith("[pyarrow]"):
            raise TypeError(f"'{string}' must end with '[pyarrow]'")
        if string == "string[pyarrow]":
            # Ensure Registry.find skips ArrowDtype to use StringDtype instead
            raise TypeError("string[pyarrow] should be constructed by StringDtype")
    
        base_type = string[:-9]  # get rid of "[pyarrow]"
        try:
>           pa_dtype = pa.type_for_alias(base_type)
E           NameError: name 'pa' is not defined

So I added skipif to check if TYPE_CHECKING is true then pyarrow will be imported in core/dtypes/dtypes.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the decorator you want for this test is @skip_if_no_pyarrow

@liang3zy22 liang3zy22 force-pushed the gh53030 branch 2 times, most recently from 466c502 to d1a8378 Compare July 9, 2023 09:49
@liang3zy22
Copy link
Contributor Author

@jbrockmendel , I have done some debug, found something about convert_dtypes function.
The convert_dtypes function's first parameter input_array could be np.ndarray with object dtype. Then the line:
inferred_dtype = lib.infer_dtype(input_array)
will be executed. Since lib.infer_dtype returns a str, then inferred_dtype also is str. So following lines will be executed:

if isinstance(inferred_dtype, str):
    # If we couldn't do anything else, then we retain the dtype
    inferred_dtype = input_array.dtype

At this time, inferred_dtype is object dtype.
In the 'pyarrow' backend codes part, all if condition failed. So this line is executed:
base_dtype = inferred_dtype
base_dtype is object dtype also. Then after pa_type = to_pyarrow_type(base_dtype), pa_type is None since base_dtype is object.
So in this situation, convert_dtypes will return object dtype as long as input_array is np.ndarray with object dtype.

@jbrockmendel
Copy link
Member

what if you do infer_objects() followed by convert_dtypes?

@liang3zy22 liang3zy22 changed the title BUG: Aggregation on arrow array return same type. BUG: Aggregation on arrow array return same type Jul 13, 2023
@liang3zy22
Copy link
Contributor Author

I have tried using infer_objects with following codes:

            if result.dtype == object:
                df = pandas.DataFrame(result)
                result = df.infer_objects().values
            pyarrow_type = convert_dtypes(result, dtype_backend="pyarrow")

It doesn't work. The dtype of result still can be object after infer_objects.

@gfyoung gfyoung added the Apply Apply, Aggregate, Transform, Map label Aug 17, 2023
@jbrockmendel jbrockmendel added the pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result label Aug 31, 2023
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
5 participants