Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: 54714 repeated sampling improvement #54780

Conversation

advaithsrao
Copy link

@advaithsrao advaithsrao commented Aug 27, 2023

This PR contains enhanced functionality for df.sample() and df.groupby().sample(), with the addition of n_samples.

Currently, there is no way to take repeated samples to create a sampling distribution using pandas alone. Currently, one would need to create a list comprehension to perform the same, and numpy.random.choice has been preferred for the same as it takes ~700ms compared to the pandas sample method taking ~25s for 1M values.

The new proposed method takes ~ 25ms for the same and performs the task 1000x faster than the current sample feature.

Screenshot 2023-08-27 at 1 40 11 PM

@advaithsrao advaithsrao changed the title Enh 54714 repeated sampling improvement ENH: 54714 repeated sampling improvement Aug 27, 2023
@advaithsrao
Copy link
Author

@joelostblom

@mroeschke
Copy link
Member

Is there a way to improve performance without adding a new keyword?

@advaithsrao
Copy link
Author

@mroeschke The goal here is to improve performance when the user wants to run multiple samples on their data. Running a for loop with pd.sample() for n different samples seems to take a lot of time. What the change does is it asks the user for number of samples, gets a flat sample, and reshapes it accordingly. This way we save massive time.

I think the new keyword is needed, else it would not work as majority of the execution time is taken by pd.sample() function

@mroeschke
Copy link
Member

pandas has a fairly high bar for adding new keywords to APIs since it's already large and would need more buy in from the core team before moving forward here, so investigating performance based on #54714 (comment) would be acceptable first

@rhshadrach
Copy link
Member

Agreed @mroeschke - I would like to understand why we can't make the current method more performant before entertaining adding a keyword for this use case.

@joelostblom
Copy link
Contributor

Thanks for starting this @advaithsrao and thanks for the comments @rhshadrach and @mroeschke.

While the performance of sample could be improved in multiple ways, one of the main advantages with adding a new parameter is the clarity of the syntax. I teach introductory statistics and data science courses using pandas, and it is quite complicated for a novice of programming to understand what is going on in the current syntax for repeated sampling:

pd.concat([
    df.sample(sample_size).assign(replicate=rep)
    for rep in range(replicates)
])

versus the suggested improvement:

df.sample(sample_size, n_samples=replicates)

This gives us less versatility in how we teach with pandas since we need to introduce several programing concepts (concatenation, list comprehension/looping, and column assignment) for students to be able to understand how to take a repeated sample. Considering that we want to use pandas in statistics and data science courses that target students who are programming novices, the current behavior is a limitation to us. We teach some of these courses in R currently, where there is a dedicated parameter for repeated sampling:

rep_sample_n(df, size = sample_size, reps = replicates)

In our experience, having a similar syntax in Python would be helpful for both instructors and students.

Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)
Addition of ``n_samples`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`)

@@ -14,6 +14,19 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_210.enhancements.sampling_improvements:

Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter
Sample methods in DataFrame and Groupby will now include a ``n_samples`` parameter


sampled_indices = np.reshape(sampled_indices, (n_samples, size))

result = result.assign(sample_no=sample_no)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replicate or sample could be more suitable names for this new column, what do you think?

Comment on lines +5635 to +5646
>>> df.groupby("a").sample(
... n=1,
... n_samples = 2,
... random_state=1
... )
a b
4 black 4
5 black 5
2 blue 2
3 blue 3
1 red 1
0 red 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> df.groupby("a").sample(
... n=1,
... n_samples = 2,
... random_state=1
... )
a b
4 black 4
5 black 5
2 blue 2
3 blue 3
1 red 1
0 red 0
>>> df.groupby("a").sample(
... n=1,
... n_samples=2,
... random_state=1
... )
a b
4 black 4
5 black 5
2 blue 2
3 blue 3
1 red 1
0 red 0

Shouldn't this output also include a column with the sample number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelostblom you are right. I need to make that change
Actually I do not think no_samples parameter is required @mroeschke . We can use something like
image @joelostblom

I think calling the pd.sample() function inside a for loop takes time, but can we not flatten into a sample_size of sample_size * replicates, and then reshape the final df like this implementation above @joelostblom

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only concern would be that we are using a single sample distribution for one single huge sample, then reshape it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it would be preferred if the sample number was returned by default. It is easier to drop that column than to create the sample_no solution via a separate step using np.repeat as you show above (that could would be hard to grok for a novice).

Regarding creating a big flattened array, how would that work with replace=False?

@rhshadrach
Copy link
Member

rhshadrach commented Aug 31, 2023

This gives us less versatility in how we teach with pandas since we need to introduce several programing concepts (concatenation, list comprehension/looping, and column assignment) for students to be able to understand how to take a repeated sample.

I appreciate that it does put additional burden on the novice programmer. However when making decisions on expanding the API and the maintenance burden that comes along with it, I don't think we can sustainably add features for the purpose of allowing users to avoid learning common concepts. And I do believe these are common enough.

@mroeschke
Copy link
Member

Thanks for the PR so far, but I think this feature needs more buy-in from other core devs before moving forward here so closing for now.

@mroeschke mroeschke closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Include a parameter for repeated sampling in df.sample
4 participants