-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: pandas mutate, add R's mutate functionality to enable users to easily create new columns in data frames #56499
Comments
We have a solution for the lambdas within assign in the pipeline, so mutate is unlikely to get added |
Seems interesting but unless I am misunderstanding, these two things contradict one another
So any constants assigned in scope of the context manager become columns also? Or is it only assignments where the RHS is a |
I don't think they necessarily contradict each other, because:
import pandas as pd
df: pd.DataFrame = <some df>
some_constant: int = 42
def make_new_col(df) -> pd.Series:
# This will of course not be assigned to the dataframe
locally_defined_constant: int = 101
return df['new_col'] * locally_defined_constant
with pd.mutate(df):
# Assigning a pd.Series directly adds it to the columns
new_col = pd.Series(["a", "b", ...])
constant_series = 42 # This will become a pd.Series having the constant value 42
other_col = make_new_col(df) And @phofl , I think this context manager is not meant to replace Further, I think this would also fit nicely in with pandas' notion of series and dataframes and users could more natively modify a dataframe's "scope". We would just have to make sure to clean up the newly assigned columns from the |
Ah, thank you for the clarification. I for one find I like the proposal more now. |
Yes, I agree this context manager would be slightly 'hacky' but I think it would boost productivity and improve code readability much more, which I think is more important. Imagine a user sets a breakpoint the first line within the context manager. If we were to clean up previous This makes it much more obvious and explicit to understand column assignment in detail than using (1) chained |
Hey, so is there any chance this feature will be added to pandas? I think it would be a great benefit for the package as a whole as it makes code more readable and easier to understand. It would also make pandas a more attractive choice compared to R in feature engineering that usually involves many (chained) assigns. Would be great to get some feedback! |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I wish feature engineering (i.e. creating new columns from old ones) could be more efficient and convenient in pandas.
Mainly, common ways of adding features to dataframes in pandas include
.assign
statements (which are hard to debug and contain many hard-to-read lambda expressions) ordf['new_column'] = ...
repeatedly in someadd_features
function, this is better for debugging purposes but also hard to read and inconvenient as the user always has to type quotes and the worddf
.In R's
mutate
function, the series are accessible directly from the scope which makes code much more readable (debugging in R is something else to discuss).Feature Description
We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's
locals
, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.This makes feature engineering much more convenient, efficient and likely also more debuggable that using chained .assign statements (in the debugger, one could directly access all the pd.Series in that scope).
A minimal example implementation could look like the following:
The drawback of this feature is that we are fiddling with the caller's
locals
which is not the most elegant.However, I believe that feature engineering like this is much better to debug and makes the code more readable (than using chained
.assign
s or repeatedly callingdf['new_feature'] = 2 * df['old_feature'] ** 2
).Therefore I think this feature would make life easier and pandas more useful (and users faster) in data science tasks.
Alternative Solutions
One might want to handle the
locals
better here to make the usage of this feature less error-prone.Perhaps one would want to cache previous
locals
and then only have the dataframe's columns as the locals in the caller's scope.This would make debugging even more clean, because if a user sets a breakpoint in such a
with pd.mutate
statement, then that user sees all the columns in the scope's locals clearly instead of having to inspect the dataframe's columns values in the debugger.Additional Context
No response
The text was updated successfully, but these errors were encountered: