Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: check monotonicity inside _can_use_libjoin #55342

Merged
merged 12 commits into from
Dec 27, 2023
Merged
26 changes: 13 additions & 13 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3334,9 +3334,7 @@ def _union(self, other: Index, sort: bool | None):

if (
sort in (None, True)
and self.is_monotonic_increasing
and other.is_monotonic_increasing
and not (self.has_duplicates and other.has_duplicates)
and (self.is_unique or other.is_unique)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order here can have a difference. is_monotonic_increasing will provide is_unique "for free" if the index is both monotonic and unique and it hasn't already been cached.

import pandas as pd
import numpy as np

arr = np.arange(1_000_000)

%timeit idx=pd.Index(arr); idx.is_unique and idx.is_monotonic_increasing
# 68.7 ms ± 5.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit idx=pd.Index(arr); idx.is_monotonic_increasing and idx.is_unique
# 2.91 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. i had expected the cache for all of these to get populated at the same time inside IndexEngine, will take a closer look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks like engine.is_monotonic_increasing will populate the engine.is_unique cache but not vice-versa. All of the paths within the engine that check is_unique do that check after checking self.is_monotonic_increasing. im inclined to update the engine code so that the unique check always populates the is_monotonic_increasing cache in order to 1) make the perf not dependent on the order these are accessed and 2) avoid populating engine.mapping in these cases (which allocates memory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im inclined to update the engine code so that the unique check always populates the is_monotonic_increasing cache in order to 1) make the perf not dependent on the order these are accessed and 2) avoid populating engine.mapping in these cases (which allocates memory)

that sounds good to me

and self._can_use_libjoin
and other._can_use_libjoin
):
Expand Down Expand Up @@ -3488,12 +3486,7 @@ def _intersection(self, other: Index, sort: bool = False):
"""
intersection specialized to the case with matching dtypes.
"""
if (
self.is_monotonic_increasing
and other.is_monotonic_increasing
and self._can_use_libjoin
and other._can_use_libjoin
):
if self._can_use_libjoin and other._can_use_libjoin:
try:
res_indexer, indexer, _ = self._inner_indexer(other)
except TypeError:
Expand Down Expand Up @@ -4613,8 +4606,7 @@ def join(

if (
not isinstance(self.dtype, CategoricalDtype)
and self.is_monotonic_increasing
and other.is_monotonic_increasing
# test_join_with_categorical_index requires excluding CategoricalDtype
and self._can_use_libjoin
and other._can_use_libjoin
and (self.is_unique or other.is_unique)
Expand Down Expand Up @@ -4921,7 +4913,10 @@ def _get_leaf_sorter(labels: list[np.ndarray]) -> npt.NDArray[np.intp]:
def _join_monotonic(
self, other: Index, how: JoinHow = "left"
) -> tuple[Index, npt.NDArray[np.intp] | None, npt.NDArray[np.intp] | None]:
# We only get here with matching dtypes and both monotonic increasing
# We only get here with (caller is responsible for ensuring):
# 1) matching dtypes
# 2) both monotonic increasing
# 3) other.is_unique or self.is_unique
assert other.dtype == self.dtype
assert self._can_use_libjoin and other._can_use_libjoin

Expand Down Expand Up @@ -5003,6 +4998,10 @@ def _can_use_libjoin(self) -> bool:
making a copy. If we cannot, this negates the performance benefit
of using libjoin.
"""
if not self.is_monotonic_increasing:
# The libjoin functions all assume monotonicity.
return False

if type(self) is Index:
# excludes EAs, but include masks, we get here with monotonic
# values only, meaning no NA
Expand All @@ -5013,7 +5012,8 @@ def _can_use_libjoin(self) -> bool:
)
# Exclude index types where the conversion to numpy converts to object dtype,
# which negates the performance benefit of libjoin
# TODO: exclude RangeIndex? Seems to break test_concat_datetime_timezone
# TODO: exclude RangeIndex (which allocates memory)?
# Doing so seems to break test_concat_datetime_timezone
return not isinstance(self, (ABCIntervalIndex, ABCMultiIndex))

# --------------------------------------------------------------------
Expand Down