Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: EA._from_scalars #53089

Merged
merged 18 commits into from
Oct 16, 2023
Merged

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented May 4, 2023

xref #33254 _from_sequence is generally not very strict. This implements _from_scalars for a few EAs as a POC to get rid of an ugly kluge in maybe_cast_pointwise_result. The changed behavior in test_resample_categorical_data_with_timedeltaindex looks like a clear improvement to me.

cc @jorisvandenbossche

Things to work out before this leaves the "WIP" zone:

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decide whether dtype should be required.

The reason to not do this would be that when starting with an int64[pyarrow], you could end up with float64[pyarrow]?
However, given that this is essentially a full type system (all possible types are available in ArrowDtype), that would essentially make this non-strict for those?

(so I would maybe start with requiring a dtype)

There are also some corner cases listed in my original PR: #38315 (comment) (need to read through that again myself as well)

Comment on lines 488 to 489
if hasattr(cls, "_from_scalars"):
# TODO: get this everywhere!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably add a simple _from_scalars in the base class that just calls _from_sequence. We will need some fallback like that for external EAs anyway (unless we keep this hasattr check, but since we fallback to _from_sequence below anyway, can better do this in _from_scalars I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this makes sense.

ATM _from_sequence can raise anything, so we catch anything, which isn't an ideal pattern. Thoughts on documenting _from_scalars as only raising ValueError/TypeError and eventually enforcing that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on documenting _from_scalars as only raising ValueError/TypeError and eventually enforcing that?

Yes, I think that's certainly fine

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@liang3zy22
Copy link
Contributor

liang3zy22 commented Jun 29, 2023

@jbrockmendel , I have done some investigation based on your PR. I have found one issue:
For pyarrow type, preserve_dtype would be True in agg_series. Then maybe_cast_pointwise_result should be called.
The maybe_cast_pointwise_result function have one default parameter, same_type to be True. So the result would be cast to the original pyarrow types. It caused issue 53030. I have changed the same_type parameter to be False by default. The issue 53030 is fixed, but some pytest test cases failed.
So in what situation should cast the result to original pyarrow types? And in what situation should cast the result to the corresponding pyarrow types?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche i think the last round of comments and OP issues have been addressed. any other thoughts?

@jbrockmendel jbrockmendel changed the title WIP/ENH: EA._from_scalars ENH: EA._from_scalars Jul 18, 2023
@@ -280,6 +283,38 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = Fal
"""
raise AbstractMethodError(cls)

@classmethod
def _from_scalars(cls, scalars, *, dtype: DtypeObj) -> Self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking aloud. Once we allow EA authors to specify their "is_scalar" passing equivalent, should that be incorporated here somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Assuming at some point we add to the EADtype something like a _is_recognized_scalar method, the default implementation of that method might look like (assuming away NAs for the moment)

def _is_recognized_scalar(self, scalar) -> bool:
    cls = self.construct_array_class()
    try:
        cls._from_scalars([scalar], dtype=self)
    except TypeError:
        return False
    return True

We could alternately implement a default from_scalars in terms of _is_recognized_scalar. _find_compatible_dtype from #53106 could also be the primitive from which the other two are constructed (though i dont think it can be constructed from either).

@jbrockmendel jbrockmendel added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. and removed Stale labels Aug 3, 2023
@jbrockmendel
Copy link
Member Author

If there are no further comments, I'm planning to merge this at the end of the week.

@jbrockmendel jbrockmendel merged commit 746e5ee into pandas-dev:main Oct 16, 2023
33 checks passed
@jbrockmendel jbrockmendel deleted the enh-from_scalars branch October 16, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
4 participants