-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Move factorizing for merge to EA interface #54975
Conversation
pandas/core/arrays/datetimelike.py
Outdated
@@ -1703,6 +1704,15 @@ def _groupby_op( | |||
res_values = res_values.view(self._ndarray.dtype) | |||
return self._from_backing_data(res_values) | |||
|
|||
def _factorize_with_other( | |||
self, | |||
other: DatetimeLikeArrayMixin, # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is other Self here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's convenient
pandas/core/arrays/base.py
Outdated
) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: | ||
lk, _ = self._values_for_factorize() | ||
rk, _ = other._values_for_factorize() | ||
return factorize_with_rizer(lk, rk, sort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rizer is an awful name. can we improve that while we're here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to factorizer and renamed the method to factorize_arrays
@@ -2399,133 +2374,27 @@ def _factorize_keys( | |||
if ( | |||
isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype) | |||
) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")): | |||
# Extract the ndarray (UTC-localized) values | |||
# Note: we dont need the dtypes to match, as these can still be compared | |||
lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the ensure_matching_resos be done in the EA method similar to how the Categorical one does encode_with_whatever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, never mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests with timedelta64 and mismatched resos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know
pandas/core/arrays/categorical.py
Outdated
def _factorize_with_other( | ||
self, other: Categorical, sort: bool = False # type: ignore[override] | ||
) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: | ||
other = self._encode_with_my_categories(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that factorize_with_other assumes matching dtypes. if correct, comment/docstring/assert to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct, have to add docs in general if we want to move forward here
Generally looks good, mostly bikeshedding names |
# Conflicts: # pandas/core/reshape/merge.py
pandas/core/arrays/arrow/array.py
Outdated
) | ||
length = len(dc.dictionary) | ||
|
||
llab, rlab, count = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: can this be split into three lines. it takes effort to see which lines correspond to which output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suer
a handful of nitpicks, otherwise im very happy to see this refactor |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
# Conflicts: # doc/source/reference/extensions.rst # pandas/core/reshape/merge.py
Sorry this fell off my radar, updated |
@MarcoGorelli is there a way to shut up the build about the missing example? There are a bunch of ea methods without examples... |
+1. Many of the examples are not particularly useful (I know because I wrote them to get the CI to shut up) |
ci/code_checks.sh
Outdated
@@ -1,3 +1,6 @@ | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no...
thx
yeah I'd expect pasting into pandas/scripts/validate_docstrings.py Lines 44 to 58 in 13e034a
to work |
# Conflicts: # pandas/core/reshape/merge.py
Looks like this has gone stale, so I'm bumping off the milestone |
Looks close to merge but has gone stale. Going to close to clear the queue but feel free to reopen when you have time to circle back |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @jbrockmendel thoughts here? This would still need docs obviously