Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Surprisingly slow nlargest with duplicates in the index #55767

Open
2 of 3 tasks
paddymul opened this issue Oct 30, 2023 · 5 comments
Open
2 of 3 tasks

PERF: Surprisingly slow nlargest with duplicates in the index #55767

paddymul opened this issue Oct 30, 2023 · 5 comments
Labels
Performance Memory or execution speed performance
Milestone

Comments

@paddymul
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
%timeit slow_df['a'].nlargest()

fast_df = pd.DataFrame({'a': np.arange(N)})
%timeit fast_df['a'].nlargest()

results in

448 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
105 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

so 4000 times slower with a non unique index. This was very surprising.

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Fri Dec 16 00:34:59 PST 2022; root:xnu-7195.141.49~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

@paddymul paddymul added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 30, 2023
@paddymul
Copy link
Contributor Author

Some additional examples that I tried. It's not just pd.NA that causes the problem. having a random array isn't significantly slower than an ordered array. Also, the performance scales badly (BigO unknown) with the number of duplicates, not the total size (look at med_df.

N = 1500
N_DOUBLE = 3000
N_HALF = 750
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
print("slow_df")
%timeit slow_df['a'].nlargest()

slow_df2 = pd.DataFrame({'a': np.arange(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df2")
%timeit slow_df2['a'].nlargest()

slow_df3 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df3")
%timeit slow_df3['a'].nlargest()

slow_df4 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[1] * N, np.arange(N)]))
print("slow_df4")
%timeit slow_df4['a'].nlargest()

med_df = pd.DataFrame({'a':  np.random.rand(N)}, index=np.concatenate([[1] * N_HALF, np.arange(N_HALF)]))
print("med_df")
%timeit med_df['a'].nlargest()


fast_df = pd.DataFrame({'a': np.arange(N)})
print("fast_df")
%timeit fast_df['a'].nlargest()

fast_df2 = pd.DataFrame({'a': np.random.rand(N)})
print("fast_df2")
%timeit fast_df2['a'].nlargest()

results in

slow_df
445 ms ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df2
444 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df3
439 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df4
449 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
med_df
18.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fast_df
105 µs ± 616 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
fast_df2
121 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@paddymul
Copy link
Contributor Author

profiling output

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
slow_ser = slow_df['a']
%prun  -s cumtime slow_ser.nlargest()

results in

 

         4063 function calls (4053 primitive calls) in 0.669 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.669    0.669 {built-in method builtins.exec}
        1    0.000    0.000    0.669    0.669 <string>:1(<module>)
        1    0.000    0.000    0.669    0.669 series.py:4006(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:55(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:90(compute)
        1    0.000    0.000    0.668    0.668 series.py:5047(drop)
        1    0.000    0.000    0.668    0.668 generic.py:4680(drop)
        1    0.004    0.004    0.668    0.668 generic.py:4719(_drop_axis)
        1    0.000    0.000    0.658    0.658 base.py:6076(get_indexer_for)
        1    0.000    0.000    0.658    0.658 base.py:6036(get_indexer_non_unique)
        1    0.181    0.181    0.657    0.657 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
      225    0.475    0.002    0.476    0.002 fromnumeric.py:1407(resize)
        1    0.000    0.000    0.003    0.003 common.py:261(index_labels_to_array)
        3    0.000    0.000    0.003    0.001 common.py:228(asarray_tuplesafe)
       10    0.003    0.000    0.003    0.000 {built-in method numpy.asarray}
        1    0.000    0.000    0.001    0.001 base.py:6467(isin)
        2    0.000    0.000    0.001    0.000 base.py:477(__new__)
        1    0.001    0.001    0.001    0.001 algorithms.py:457(isin)
        2    0.000    0.000    0.001    0.000 base.py:7512(ensure_index)
      225    0.000    0.000    0.001    0.000 fromnumeric.py:200(reshape)

@paddymul
Copy link
Contributor Author

pandas/core/methods/selectn.py#L101-L102

        dropped = self.obj.dropna()
        nan_index = self.obj.drop(dropped.index)

are the offending lines. Particularly nan_index

Would it work/be faster to do this

  1. compute a temporary integer based index
  2. run smallest/largest on the new series
  3. grab the actual index values corresponding with the integer based index
  4. use those index values in the return series?

https://github.com/pandas-dev/pandas/blob/main/pandas/core/methods/selectn.py#L101-L102

@lukemanley
Copy link
Member

Thanks for the report. I think you could probably change those two lines to something like this to avoid the bottleneck:

mask = self.obj.isna()
dropped = self.obj[~mask]
nan_index = self.obj[mask]

If you would like to open a PR go for it. Otherwise I can give this a try.

@lukemanley lukemanley removed the Needs Triage Issue that has not been reviewed by a pandas team member label Nov 4, 2023
@lukemanley lukemanley added this to the 2.2 milestone Nov 4, 2023
@paddymul
Copy link
Contributor Author

paddymul commented Nov 4, 2023

I would like to try to tackle this. I'll try to get to this early in the coming week.

@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lukemanley lukemanley modified the milestones: 2.2.1, 3.0 Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants