ENH: 54714 repeated sampling improvement #54780

advaithsrao · 2023-08-27T17:41:17Z

This PR contains enhanced functionality for df.sample() and df.groupby().sample(), with the addition of n_samples.

Currently, there is no way to take repeated samples to create a sampling distribution using pandas alone. Currently, one would need to create a list comprehension to perform the same, and numpy.random.choice has been preferred for the same as it takes ~700ms compared to the pandas sample method taking ~25s for 1M values.

The new proposed method takes ~ 25ms for the same and performs the task 1000x faster than the current sample feature.

closes ENH: Include a parameter for repeated sampling in df.sample #54714
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…bject

…ples

advaithsrao · 2023-08-27T17:44:38Z

@joelostblom

…mpling improvement enhancements

mroeschke · 2023-08-28T15:33:01Z

Is there a way to improve performance without adding a new keyword?

advaithsrao · 2023-08-28T18:29:38Z

@mroeschke The goal here is to improve performance when the user wants to run multiple samples on their data. Running a for loop with pd.sample() for n different samples seems to take a lot of time. What the change does is it asks the user for number of samples, gets a flat sample, and reshapes it accordingly. This way we save massive time.

I think the new keyword is needed, else it would not work as majority of the execution time is taken by pd.sample() function

mroeschke · 2023-08-28T18:33:15Z

pandas has a fairly high bar for adding new keywords to APIs since it's already large and would need more buy in from the core team before moving forward here, so investigating performance based on #54714 (comment) would be acceptable first

rhshadrach · 2023-08-29T04:15:07Z

Agreed @mroeschke - I would like to understand why we can't make the current method more performant before entertaining adding a keyword for this use case.

joelostblom · 2023-08-29T14:01:39Z

Thanks for starting this @advaithsrao and thanks for the comments @rhshadrach and @mroeschke.

While the performance of sample could be improved in multiple ways, one of the main advantages with adding a new parameter is the clarity of the syntax. I teach introductory statistics and data science courses using pandas, and it is quite complicated for a novice of programming to understand what is going on in the current syntax for repeated sampling:

pd.concat([
    df.sample(sample_size).assign(replicate=rep)
    for rep in range(replicates)
])

versus the suggested improvement:

df.sample(sample_size, n_samples=replicates)

This gives us less versatility in how we teach with pandas since we need to introduce several programing concepts (concatenation, list comprehension/looping, and column assignment) for students to be able to understand how to take a repeated sample. Considering that we want to use pandas in statistics and data science courses that target students who are programming novices, the current behavior is a limitation to us. We teach some of these courses in R currently, where there is a dedicated parameter for repeated sampling:

rep_sample_n(df, size = sample_size, reps = replicates)

In our experience, having a similar syntax in Python would be helpful for both instructors and students.

joelostblom · 2023-08-29T14:02:25Z

doc/source/whatsnew/v2.1.0.rst

+Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)


Suggested change

Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)

Addition of ``n_samples`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)

joelostblom · 2023-08-29T14:02:36Z

doc/source/whatsnew/v2.1.0.rst

@@ -14,6 +14,19 @@ including other versions of pandas.
 Enhancements
 ~~~~~~~~~~~~

+.. _whatsnew_210.enhancements.sampling_improvements:
+
+Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter


Suggested change

Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter

Sample methods in DataFrame and Groupby will now include a ``n_samples`` parameter

joelostblom · 2023-08-29T14:04:45Z

pandas/core/generic.py

+
+            sampled_indices = np.reshape(sampled_indices, (n_samples, size))
+
+            result = result.assign(sample_no=sample_no)


I think replicate or sample could be more suitable names for this new column, what do you think?

joelostblom · 2023-08-29T14:05:23Z

pandas/core/groupby/groupby.py

+        >>> df.groupby("a").sample(
+        ...   n=1,
+        ...   n_samples = 2,
+        ...   random_state=1
+        ... )
+                a  b
+        4   black  4
+        5   black  5
+        2    blue  2
+        3    blue  3
+        1     red  1
+        0     red  0


Suggested change

>>> df.groupby("a").sample(

... n=1,

... n_samples = 2,

... random_state=1

... )

a b

4 black 4

5 black 5

2 blue 2

3 blue 3

1 red 1

0 red 0

>>> df.groupby("a").sample(

... n=1,

... n_samples=2,

... random_state=1

... )

a b

4 black 4

5 black 5

2 blue 2

3 blue 3

1 red 1

0 red 0

Shouldn't this output also include a column with the sample number?

@joelostblom you are right. I need to make that change
Actually I do not think no_samples parameter is required @mroeschke . We can use something like
@joelostblom

I think calling the pd.sample() function inside a for loop takes time, but can we not flatten into a sample_size of sample_size * replicates, and then reshape the final df like this implementation above @joelostblom

the only concern would be that we are using a single sample distribution for one single huge sample, then reshape it

For me it would be preferred if the sample number was returned by default. It is easier to drop that column than to create the sample_no solution via a separate step using np.repeat as you show above (that could would be hard to grok for a novice).

Regarding creating a big flattened array, how would that work with replace=False?

rhshadrach · 2023-08-31T21:10:06Z

This gives us less versatility in how we teach with pandas since we need to introduce several programing concepts (concatenation, list comprehension/looping, and column assignment) for students to be able to understand how to take a repeated sample.

I appreciate that it does put additional burden on the novice programmer. However when making decisions on expanding the API and the maintenance burden that comes along with it, I don't think we can sustainably add features for the purpose of allowing users to avoid learning common concepts. And I do believe these are common enough.

mroeschke · 2023-09-18T16:58:37Z

Thanks for the PR so far, but I think this feature needs more buy-in from other core devs before moving forward here so closing for now.

advaithsrao added 6 commits August 27, 2023 12:53

ENH-54714: Added sample with n_samples to df.sample method

dbea3d2

ENH-54714: Added n_samples to the core sample function

6fca54f

ENH-54714: Added n_samples to sample method under the groupby class o…

81540d7

…bject

ENH-54714: Added testing for sample function under groupby with n_sam…

a584550

…ples

ENH-54714: Added sampling with n_samples use cases to indexing.rst doc

f8ec4ec

ENH-54714: Fixed issue with whatsnew.rst file

94ed9fc

advaithsrao requested a review from rhshadrach as a code owner August 27, 2023 17:41

advaithsrao changed the title ~~Enh 54714 repeated sampling improvement~~ ENH: 54714 repeated sampling improvement Aug 27, 2023

ENH-54714: Fixed test issues and updated v2.1 whatsnew doc for the sa…

52f44ec

…mpling improvement enhancements

Merge branch 'main' into ENH-54714-Repeated-Sampling-Improvement

9e67abe

joelostblom reviewed Aug 29, 2023

View reviewed changes

mroeschke closed this Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: 54714 repeated sampling improvement #54780

ENH: 54714 repeated sampling improvement #54780

advaithsrao commented Aug 27, 2023 •

edited

Loading

advaithsrao commented Aug 27, 2023

mroeschke commented Aug 28, 2023

advaithsrao commented Aug 28, 2023

mroeschke commented Aug 28, 2023

rhshadrach commented Aug 29, 2023

joelostblom commented Aug 29, 2023

joelostblom Aug 29, 2023

joelostblom Aug 29, 2023

joelostblom Aug 29, 2023

joelostblom Aug 29, 2023

advaithsrao Aug 30, 2023

advaithsrao Aug 30, 2023

joelostblom Aug 31, 2023

rhshadrach commented Aug 31, 2023 •

edited

Loading

mroeschke commented Sep 18, 2023

	Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)
	Addition of ``n_samples`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)

	Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter
	Sample methods in DataFrame and Groupby will now include a ``n_samples`` parameter


		sampled_indices = np.reshape(sampled_indices, (n_samples, size))

		result = result.assign(sample_no=sample_no)

ENH: 54714 repeated sampling improvement #54780

ENH: 54714 repeated sampling improvement #54780

Conversation

advaithsrao commented Aug 27, 2023 • edited Loading

advaithsrao commented Aug 27, 2023

mroeschke commented Aug 28, 2023

advaithsrao commented Aug 28, 2023

mroeschke commented Aug 28, 2023

rhshadrach commented Aug 29, 2023

joelostblom commented Aug 29, 2023

joelostblom Aug 29, 2023

Choose a reason for hiding this comment

joelostblom Aug 29, 2023

Choose a reason for hiding this comment

joelostblom Aug 29, 2023

Choose a reason for hiding this comment

joelostblom Aug 29, 2023

Choose a reason for hiding this comment

advaithsrao Aug 30, 2023

Choose a reason for hiding this comment

advaithsrao Aug 30, 2023

Choose a reason for hiding this comment

joelostblom Aug 31, 2023

Choose a reason for hiding this comment

rhshadrach commented Aug 31, 2023 • edited Loading

mroeschke commented Sep 18, 2023

advaithsrao commented Aug 27, 2023 •

edited

Loading

rhshadrach commented Aug 31, 2023 •

edited

Loading