PERF: Surprisingly slow nlargest with duplicates in the index #55767

paddymul · 2023-10-30T18:24:40Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
%timeit slow_df['a'].nlargest()

fast_df = pd.DataFrame({'a': np.arange(N)})
%timeit fast_df['a'].nlargest()

results in

448 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
105 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

so 4000 times slower with a non unique index. This was very surprising.

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Fri Dec 16 00:34:59 PST 2022; root:xnu-7195.141.49~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

paddymul · 2023-10-30T18:34:05Z

Some additional examples that I tried. It's not just pd.NA that causes the problem. having a random array isn't significantly slower than an ordered array. Also, the performance scales badly (BigO unknown) with the number of duplicates, not the total size (look at med_df.

N = 1500
N_DOUBLE = 3000
N_HALF = 750
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
print("slow_df")
%timeit slow_df['a'].nlargest()

slow_df2 = pd.DataFrame({'a': np.arange(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df2")
%timeit slow_df2['a'].nlargest()

slow_df3 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df3")
%timeit slow_df3['a'].nlargest()

slow_df4 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[1] * N, np.arange(N)]))
print("slow_df4")
%timeit slow_df4['a'].nlargest()

med_df = pd.DataFrame({'a':  np.random.rand(N)}, index=np.concatenate([[1] * N_HALF, np.arange(N_HALF)]))
print("med_df")
%timeit med_df['a'].nlargest()


fast_df = pd.DataFrame({'a': np.arange(N)})
print("fast_df")
%timeit fast_df['a'].nlargest()

fast_df2 = pd.DataFrame({'a': np.random.rand(N)})
print("fast_df2")
%timeit fast_df2['a'].nlargest()

results in

slow_df
445 ms ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df2
444 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df3
439 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df4
449 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
med_df
18.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fast_df
105 µs ± 616 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
fast_df2
121 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

paddymul · 2023-10-30T20:21:18Z

profiling output

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
slow_ser = slow_df['a']
%prun  -s cumtime slow_ser.nlargest()

results in

 

         4063 function calls (4053 primitive calls) in 0.669 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.669    0.669 {built-in method builtins.exec}
        1    0.000    0.000    0.669    0.669 <string>:1(<module>)
        1    0.000    0.000    0.669    0.669 series.py:4006(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:55(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:90(compute)
        1    0.000    0.000    0.668    0.668 series.py:5047(drop)
        1    0.000    0.000    0.668    0.668 generic.py:4680(drop)
        1    0.004    0.004    0.668    0.668 generic.py:4719(_drop_axis)
        1    0.000    0.000    0.658    0.658 base.py:6076(get_indexer_for)
        1    0.000    0.000    0.658    0.658 base.py:6036(get_indexer_non_unique)
        1    0.181    0.181    0.657    0.657 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
      225    0.475    0.002    0.476    0.002 fromnumeric.py:1407(resize)
        1    0.000    0.000    0.003    0.003 common.py:261(index_labels_to_array)
        3    0.000    0.000    0.003    0.001 common.py:228(asarray_tuplesafe)
       10    0.003    0.000    0.003    0.000 {built-in method numpy.asarray}
        1    0.000    0.000    0.001    0.001 base.py:6467(isin)
        2    0.000    0.000    0.001    0.000 base.py:477(__new__)
        1    0.001    0.001    0.001    0.001 algorithms.py:457(isin)
        2    0.000    0.000    0.001    0.000 base.py:7512(ensure_index)
      225    0.000    0.000    0.001    0.000 fromnumeric.py:200(reshape)

paddymul · 2023-10-30T20:37:56Z

pandas/core/methods/selectn.py#L101-L102

        dropped = self.obj.dropna()
        nan_index = self.obj.drop(dropped.index)

are the offending lines. Particularly nan_index

Would it work/be faster to do this

compute a temporary integer based index
run smallest/largest on the new series
grab the actual index values corresponding with the integer based index
use those index values in the return series?

https://github.com/pandas-dev/pandas/blob/main/pandas/core/methods/selectn.py#L101-L102

lukemanley · 2023-11-04T13:41:55Z

Thanks for the report. I think you could probably change those two lines to something like this to avoid the bottleneck:

mask = self.obj.isna()
dropped = self.obj[~mask]
nan_index = self.obj[mask]

If you would like to open a PR go for it. Otherwise I can give this a try.

paddymul · 2023-11-04T21:34:45Z

I would like to try to tackle this. I'll try to get to this early in the coming week.

paddymul added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 30, 2023

paddymul mentioned this issue Oct 30, 2023

Performance regressions for non_unique indexes paddymul/buckaroo#87

Closed

lukemanley removed the Needs Triage Issue that has not been reviewed by a pandas team member label Nov 4, 2023

lukemanley added this to the 2.2 milestone Nov 4, 2023

lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024

lukemanley modified the milestones: 2.2.1, 3.0 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Surprisingly slow nlargest with duplicates in the index #55767

PERF: Surprisingly slow nlargest with duplicates in the index #55767

paddymul commented Oct 30, 2023

INSTALLED VERSIONS

paddymul commented Oct 30, 2023

paddymul commented Oct 30, 2023

paddymul commented Oct 30, 2023

lukemanley commented Nov 4, 2023

paddymul commented Nov 4, 2023

PERF: Surprisingly slow nlargest with duplicates in the index #55767

PERF: Surprisingly slow nlargest with duplicates in the index #55767

Comments

paddymul commented Oct 30, 2023

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

paddymul commented Oct 30, 2023

paddymul commented Oct 30, 2023

paddymul commented Oct 30, 2023

lukemanley commented Nov 4, 2023

paddymul commented Nov 4, 2023