-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: 54714 repeated sampling improvement #54780
ENH: 54714 repeated sampling improvement #54780
Conversation
…mpling improvement enhancements
Is there a way to improve performance without adding a new keyword? |
@mroeschke The goal here is to improve performance when the user wants to run multiple samples on their data. Running a for loop with pd.sample() for n different samples seems to take a lot of time. What the change does is it asks the user for number of samples, gets a flat sample, and reshapes it accordingly. This way we save massive time. I think the new keyword is needed, else it would not work as majority of the execution time is taken by pd.sample() function |
pandas has a fairly high bar for adding new keywords to APIs since it's already large and would need more buy in from the core team before moving forward here, so investigating performance based on #54714 (comment) would be acceptable first |
Agreed @mroeschke - I would like to understand why we can't make the current method more performant before entertaining adding a keyword for this use case. |
Thanks for starting this @advaithsrao and thanks for the comments @rhshadrach and @mroeschke. While the performance of pd.concat([
df.sample(sample_size).assign(replicate=rep)
for rep in range(replicates)
]) versus the suggested improvement: df.sample(sample_size, n_samples=replicates) This gives us less versatility in how we teach with pandas since we need to introduce several programing concepts (concatenation, list comprehension/looping, and column assignment) for students to be able to understand how to take a repeated sample. Considering that we want to use pandas in statistics and data science courses that target students who are programming novices, the current behavior is a limitation to us. We teach some of these courses in R currently, where there is a dedicated parameter for repeated sampling:
In our experience, having a similar syntax in Python would be helpful for both instructors and students. |
Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addition of ``n_sampling`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`) | |
Addition of ``n_samples`` parameter to :meth:`DataFrame.sample` and :meth:`Groupby.sample`. (:issue:`54714`) |
@@ -14,6 +14,19 @@ including other versions of pandas. | |||
Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
.. _whatsnew_210.enhancements.sampling_improvements: | |||
|
|||
Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample methods in DataFrame and Groupby will now include a ``n_sampling`` parameter | |
Sample methods in DataFrame and Groupby will now include a ``n_samples`` parameter |
|
||
sampled_indices = np.reshape(sampled_indices, (n_samples, size)) | ||
|
||
result = result.assign(sample_no=sample_no) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think replicate
or sample
could be more suitable names for this new column, what do you think?
>>> df.groupby("a").sample( | ||
... n=1, | ||
... n_samples = 2, | ||
... random_state=1 | ||
... ) | ||
a b | ||
4 black 4 | ||
5 black 5 | ||
2 blue 2 | ||
3 blue 3 | ||
1 red 1 | ||
0 red 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> df.groupby("a").sample( | |
... n=1, | |
... n_samples = 2, | |
... random_state=1 | |
... ) | |
a b | |
4 black 4 | |
5 black 5 | |
2 blue 2 | |
3 blue 3 | |
1 red 1 | |
0 red 0 | |
>>> df.groupby("a").sample( | |
... n=1, | |
... n_samples=2, | |
... random_state=1 | |
... ) | |
a b | |
4 black 4 | |
5 black 5 | |
2 blue 2 | |
3 blue 3 | |
1 red 1 | |
0 red 0 |
Shouldn't this output also include a column with the sample number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelostblom you are right. I need to make that change
Actually I do not think no_samples parameter is required @mroeschke . We can use something like
@joelostblom
I think calling the pd.sample() function inside a for loop takes time, but can we not flatten into a sample_size of sample_size * replicates
, and then reshape the final df like this implementation above @joelostblom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only concern would be that we are using a single sample distribution for one single huge sample, then reshape it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it would be preferred if the sample number was returned by default. It is easier to drop that column than to create the sample_no
solution via a separate step using np.repeat
as you show above (that could would be hard to grok for a novice).
Regarding creating a big flattened array, how would that work with replace=False
?
I appreciate that it does put additional burden on the novice programmer. However when making decisions on expanding the API and the maintenance burden that comes along with it, I don't think we can sustainably add features for the purpose of allowing users to avoid learning common concepts. And I do believe these are common enough. |
Thanks for the PR so far, but I think this feature needs more buy-in from other core devs before moving forward here so closing for now. |
This PR contains enhanced functionality for df.sample() and df.groupby().sample(), with the addition of n_samples.
Currently, there is no way to take repeated samples to create a sampling distribution using pandas alone. Currently, one would need to create a list comprehension to perform the same, and numpy.random.choice has been preferred for the same as it takes ~700ms compared to the pandas sample method taking ~25s for 1M values.
The new proposed method takes ~ 25ms for the same and performs the task 1000x faster than the current sample feature.
df.sample
#54714doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.