ENH: EA._from_scalars #53089

jbrockmendel · 2023-05-04T23:24:00Z

closes API: EA interface - strictness of _from_sequence #33254
closes API: inconsistency in casting back to its EA dtype (try_cast_to_ea) #31108 (someone double-check me on this)

xref #33254 _from_sequence is generally not very strict. This implements _from_scalars for a few EAs as a POC to get rid of an ugly kluge in maybe_cast_pointwise_result. The changed behavior in test_resample_categorical_data_with_timedeltaindex looks like a clear improvement to me.

cc @jorisvandenbossche

Things to work out before this leaves the "WIP" zone:

implement/document it on the base class, none of this "hasattr" shenanigans
decide whether dtype should be required. Not specifying dtype has different implications for different EA subclasses, which is not ideal. I think part of the motivation is to be able to preserve pyarrow/masked "flavors" when doing pointwise ops.
see if this fixes BUG: groupby.agg with udf incorrectly inferring pyarrow timestamp dtype #49163, BUG: UDFs with apply returning nanoseconds when compared to native binops that return seconds #52411 update Nope!

jorisvandenbossche

decide whether dtype should be required.

The reason to not do this would be that when starting with an int64[pyarrow], you could end up with float64[pyarrow]?
However, given that this is essentially a full type system (all possible types are available in ArrowDtype), that would essentially make this non-strict for those?

(so I would maybe start with requiring a dtype)

There are also some corner cases listed in my original PR: #38315 (comment) (need to read through that again myself as well)

jorisvandenbossche · 2023-05-15T16:05:11Z

pandas/core/dtypes/cast.py

+    if hasattr(cls, "_from_scalars"):
+        # TODO: get this everywhere!


You can probably add a simple _from_scalars in the base class that just calls _from_sequence. We will need some fallback like that for external EAs anyway (unless we keep this hasattr check, but since we fallback to _from_sequence below anyway, can better do this in _from_scalars I think)

Agreed this makes sense.

ATM _from_sequence can raise anything, so we catch anything, which isn't an ideal pattern. Thoughts on documenting _from_scalars as only raising ValueError/TypeError and eventually enforcing that?

Thoughts on documenting _from_scalars as only raising ValueError/TypeError and eventually enforcing that?

Yes, I think that's certainly fine

github-actions · 2023-06-16T00:05:36Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

liang3zy22 · 2023-06-29T02:54:09Z

@jbrockmendel , I have done some investigation based on your PR. I have found one issue:
For pyarrow type, preserve_dtype would be True in agg_series. Then maybe_cast_pointwise_result should be called.
The maybe_cast_pointwise_result function have one default parameter, same_type to be True. So the result would be cast to the original pyarrow types. It caused issue 53030. I have changed the same_type parameter to be False by default. The issue 53030 is fixed, but some pytest test cases failed.
So in what situation should cast the result to original pyarrow types? And in what situation should cast the result to the corresponding pyarrow types?

jbrockmendel · 2023-07-11T17:27:06Z

@jorisvandenbossche i think the last round of comments and OP issues have been addressed. any other thoughts?

mroeschke · 2023-07-31T22:22:47Z

pandas/core/arrays/base.py

@@ -280,6 +283,38 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = Fal
        """
        raise AbstractMethodError(cls)

+    @classmethod
+    def _from_scalars(cls, scalars, *, dtype: DtypeObj) -> Self:


Thinking aloud. Once we allow EA authors to specify their "is_scalar" passing equivalent, should that be incorporated here somehow?

Yes. Assuming at some point we add to the EADtype something like a _is_recognized_scalar method, the default implementation of that method might look like (assuming away NAs for the moment)

def _is_recognized_scalar(self, scalar) -> bool: cls = self.construct_array_class() try: cls._from_scalars([scalar], dtype=self) except TypeError: return False return True

We could alternately implement a default from_scalars in terms of _is_recognized_scalar. _find_compatible_dtype from #53106 could also be the primitive from which the other two are constructed (though i dont think it can be constructed from either).

jbrockmendel · 2023-10-09T22:59:14Z

If there are no further comments, I'm planning to merge this at the end of the week.

jbrockmendel added 2 commits May 4, 2023 12:36

ENH: BaseStringArray._from_scalars

057586c

WIP: EA._from_scalars

edc8b9b

jorisvandenbossche reviewed May 15, 2023

View reviewed changes

jbrockmendel added 2 commits May 15, 2023 13:35

Merge branch 'main' into enh-from_scalars

5eb892f

ENH: implement EA._from_scalars

e9853c8

github-actions bot added the Stale label Jun 16, 2023

jbrockmendel mentioned this pull request Jun 22, 2023

BUG: Aggregation on arrow array return same type #53717

Closed

5 tasks

jbrockmendel added 6 commits June 30, 2023 16:41

Merge branch 'main' into enh-from_scalars

55fa286

Merge branch 'main' into enh-from_scalars

5f94507

Fix StringDtype/CategoricalDtype combine

8c1cfce

Merge branch 'main' into enh-from_scalars

8e2a593

mypy fixup

5238141

Merge branch 'main' into enh-from_scalars

b2ab1b9

jbrockmendel changed the title ~~WIP/ENH: EA._from_scalars~~ ENH: EA._from_scalars Jul 18, 2023

jbrockmendel added 3 commits July 27, 2023 14:24

Merge branch 'main' into enh-from_scalars

90b373e

Merge branch 'main' into enh-from_scalars

3d55300

Merge branch 'main' into enh-from_scalars

7c85184

mroeschke reviewed Jul 31, 2023

View reviewed changes

Merge branch 'main' into enh-from_scalars

5993da8

jbrockmendel added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. and removed Stale labels Aug 3, 2023

mroeschke mentioned this pull request Aug 21, 2023

Enable many tests for complex numbers #54441

Closed

5 tasks

jbrockmendel added 3 commits August 29, 2023 10:00

Merge branch 'main' into enh-from_scalars

fb1335b

Merge branch 'main' into enh-from_scalars

a37de85

Merge branch 'main' into enh-from_scalars

147366b

Merge branch 'main' into enh-from_scalars

685ee12

jbrockmendel merged commit 746e5ee into pandas-dev:main Oct 16, 2023
33 checks passed

jbrockmendel deleted the enh-from_scalars branch October 16, 2023 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: EA._from_scalars #53089

ENH: EA._from_scalars #53089

jbrockmendel commented May 4, 2023 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche May 15, 2023

jbrockmendel May 15, 2023

jorisvandenbossche May 16, 2023

github-actions bot commented Jun 16, 2023

liang3zy22 commented Jun 29, 2023 •

edited

Loading

jbrockmendel commented Jul 11, 2023

mroeschke Jul 31, 2023

jbrockmendel Jul 31, 2023

jbrockmendel commented Oct 9, 2023

		if hasattr(cls, "_from_scalars"):
		# TODO: get this everywhere!

ENH: EA._from_scalars #53089

ENH: EA._from_scalars #53089

Conversation

jbrockmendel commented May 4, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche May 15, 2023

Choose a reason for hiding this comment

jbrockmendel May 15, 2023

Choose a reason for hiding this comment

jorisvandenbossche May 16, 2023

Choose a reason for hiding this comment

github-actions bot commented Jun 16, 2023

liang3zy22 commented Jun 29, 2023 • edited Loading

jbrockmendel commented Jul 11, 2023

mroeschke Jul 31, 2023

Choose a reason for hiding this comment

jbrockmendel Jul 31, 2023

Choose a reason for hiding this comment

jbrockmendel commented Oct 9, 2023

jbrockmendel commented May 4, 2023 •

edited

Loading

liang3zy22 commented Jun 29, 2023 •

edited

Loading