Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Move factorizing for merge to EA interface #54975

Closed
wants to merge 19 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 2, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

cc @jbrockmendel thoughts here? This would still need docs obviously

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 5, 2023
@@ -1703,6 +1704,15 @@ def _groupby_op(
res_values = res_values.view(self._ndarray.dtype)
return self._from_backing_data(res_values)

def _factorize_with_other(
self,
other: DatetimeLikeArrayMixin, # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is other Self here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's convenient

) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]:
lk, _ = self._values_for_factorize()
rk, _ = other._values_for_factorize()
return factorize_with_rizer(lk, rk, sort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rizer is an awful name. can we improve that while we're here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to factorizer and renamed the method to factorize_arrays

@@ -2399,133 +2374,27 @@ def _factorize_keys(
if (
isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")):
# Extract the ndarray (UTC-localized) values
# Note: we dont need the dtypes to match, as these can still be compared
lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the ensure_matching_resos be done in the EA method similar to how the Categorical one does encode_with_whatever?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, never mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests with timedelta64 and mismatched resos?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know

def _factorize_with_other(
self, other: Categorical, sort: bool = False # type: ignore[override]
) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]:
other = self._encode_with_my_categories(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that factorize_with_other assumes matching dtypes. if correct, comment/docstring/assert to that effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct, have to add docs in general if we want to move forward here

@jbrockmendel
Copy link
Member

Generally looks good, mostly bikeshedding names

@phofl phofl added this to the 2.2 milestone Sep 7, 2023
)
length = len(dc.dictionary)

llab, rlab, count = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can this be split into three lines. it takes effort to see which lines correspond to which output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suer

@jbrockmendel
Copy link
Member

a handful of nitpicks, otherwise im very happy to see this refactor

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 15, 2023
phofl added 3 commits October 15, 2023 15:47
# Conflicts:
#	doc/source/reference/extensions.rst
#	pandas/core/reshape/merge.py
@phofl
Copy link
Member Author

phofl commented Oct 15, 2023

Sorry this fell off my radar, updated

@phofl phofl requested a review from mroeschke as a code owner October 15, 2023 13:58
@phofl
Copy link
Member Author

phofl commented Oct 15, 2023

@MarcoGorelli is there a way to shut up the build about the missing example? There are a bunch of ea methods without examples...

@jbrockmendel
Copy link
Member

is there a way to shut up the build about the missing example? There are a bunch of ea methods without examples...

+1. Many of the examples are not particularly useful (I know because I wrote them to get the CI to shut up)

@@ -1,3 +1,6 @@



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace here intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no...

thx

@MarcoGorelli
Copy link
Member

yeah I'd expect pasting into

IGNORE_VALIDATION = {
"Styler.env",
"Styler.template_html",
"Styler.template_html_style",
"Styler.template_html_table",
"Styler.template_latex",
"Styler.template_string",
"Styler.loader",
"errors.InvalidComparison",
"errors.LossySetitemError",
"errors.NoBufferPresent",
"errors.IncompatibilityWarning",
"errors.PyperclipException",
"errors.PyperclipWindowsException",
}

to work

@lithomas1 lithomas1 removed this from the 2.2 milestone Dec 22, 2023
@lithomas1
Copy link
Member

Looks like this has gone stale, so I'm bumping off the milestone

@mroeschke
Copy link
Member

Looks close to merge but has gone stale. Going to close to clear the queue but feel free to reopen when you have time to circle back

@mroeschke mroeschke closed this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants