-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747
Conversation
Thanks for writing this up, much appreciated If
(for reference, the Polars API is likely to move in the above direction) |
This PDEP proposes that: | ||
|
||
- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2. | ||
- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 months is way too short of a deprecation period for a change that will most likely affect the majority of our users which requires them to modify their code in many different places. There is no easy way to opt into the new behaviour.
Edit: This should go DeprecationWarning -> FutureWarning. This will have a huge impact on downstream libraries as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised about this. I would expect users who upgrade major versions to run their code through the latest minor version before upgrading and if they get a warning, to fix them before upgrading. Same for libraries, that they would check and fix warnings before upgrading support for major versions.
Only implementing this in v4.0 seems way too slow IMO, if we decide this is a good idea in itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl are you sure about
There is no easy way to opt into the new behaviour.
? I thought the way to do that was by passing by_row=True
?
If we want this, then the usual deprecation cycle sounds fine tbh, can't wait until 4.0 to resolve such glaring inconsistencies
As for whether we want this...I'm holding on to the idea that there might an alternative which also resolves the inconsistencies. Trying to write something up
One this I don't fully understand is whether |
Some motivation for that was given here
|
I addition to the point made by @MarcoGorelli, the way I've seen it so far is that we need to have a method that operates on elements of the series (that will be Of course |
I have not thought much about this. A few things I would think about here:
More generally, naming of of this type of methods is of course very fundamental to pandas, and a different, though related, subject than this PDEP. So maybe take that in a separate issue/PDEP, and if that is a serious path forward, I can withdraw this PDEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main issue here is that the current behavior of Series.apply()
is dependent on whether the passed function is an aggregation-like function (e.g., sum
, abs
, min
, etc.) or a true ufunc that operates on a scalar (e.g., abs
, np.sqrt
, len
, np.sin
, etc.)
Your write-up is mixing up these use cases, and I think that is the main source of confusion for users.
It seems to me that we should have people no longer use Series.apply
at all, and have them use Series.agg()
if the function is aggregation-like, and use Series.map()
or Series.transform()
if the function is a true ufunc.
|
||
Currently, giving a input to `Series.apply` is treated differently depending on the type of the input: | ||
|
||
* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`. | |
* if the input is a numpy `ufunc` or strings such as `"sum"`, `"min"`, etc.,, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are a bit different than that, and series.apply(string)
is more equivalent to getattr(series, string)()
. I've added a line about strings.
|
||
* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`. | ||
* if the input is a callable, but not a numpy `ufunc`, `series.apply(func)` is similar to `Series([func(val) for val in series], index=series.index)`, i.e. similar to `series.map(func)` | ||
* if the input is a list-like or dict-like, `series.apply(func)` is equivalent to `series.agg(func)` (which is subtly different than `series.apply`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "list-like or dict-like" of callables, including numpy ufunc
callables". You don't include something like s.apply([1,2,3])
, which is list-like but won't work. So maybe be more precise here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it more precise, my intention was not to exclude strings. e.g. s.apply(["sqrt", np.sqrt])
is also possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the described behavior is the actual behavior, so e.g. a list of ints will be passed through and will end up calling each int and fail. So the current text is accurate.
|
||
* if the input is a callable, `df.apply(func)` always calls each columns in the DataFrame, so is similar to `func(col) for _, col in | ||
df.items()` + wrapping functionality | ||
* if the input is a list-like or dict-like, `df.apply` call each item in the list/dict and wraps the result as needed. So for example if the input is a list, `df.apply(func_list)` is equivalent to `[df.apply(func) for func in func_list]` + wrapping functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should add what happens with DataFrame.apply(func_or_list, axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added something about this.
|
||
- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2. | ||
- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`, | ||
- Lists or dicts of callables given to the `Series.apply` will in Pandas 3.0 always call `Series.apply` on each element of the list/dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be more precise here. I think you mean something like "Lists or dicts of callables given to Series.apply
will apply each callable to the Series
, so that the result of s.apply(list_of_callables)
is Series([func[s] for s in list_of_callables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't expand list or dicts to Dataframes, e.g. ser.apply(["sqrt", np.sqrt])
is a dataframe.
I've made a new formulation to clarify this.
dtype: int64 | ||
``` | ||
|
||
Users would expect these two to give the same result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummm, not so sure about that. What about the case of np.abs()
or just abs
? In that case Series.agg(abs)
gets applied element-wise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those will give the same results, so the text is accurate?
I agree we cannot be guaranteed that ser.agg(func)
will actually aggregate, but that is IMO a different discussion and maybe part of the discussion that has been opened about changing the API for this class of methods.
dtype: float64 | ||
``` | ||
|
||
These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you here that this inconsistency is hard to understand.
|
||
## Deprecation process | ||
|
||
To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is what would be the default behavior if by_row
is not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will emit a deprecation warning in v2.0. I've made text change to clarify this.
I think this is a good idea. I also considered going further and eliminating |
Yes, I think that is what @bashtage is suggesting, and I think it is worth considering. |
If we keep |
First, thanks again to you and @rhshadrach for your work here. There's indeed a tonne of inconsistencies which need resolving, and I think everybody is on board with that. If I've understood correct, the main issues you want to address are:
One solution (which you're proposing) would be to change My issue with this is that, as far as I can tell, users are already only thinking of From my Kaggle scrapes*, I only see:
I couldn't find a single case of If we're going to make breaking changes, I'd suggest that they be made in the direction of keeping
Then:
To respond to your questions:
I don't see why, no.
The name tells you that. So, Regarding the deprecation process, the advantage of moving to a new function name is that you tell people to just use that. For example:
and users can just change their code once. Conversely, the PDEP's suggested Sorry if I've missed anything fundamental, and thanks for your patience pushing this through. I really appreciate it, and agree that these inconsistencies really need addressing |
@MarcoGorelli i am not sure why you are trying to break with the pandas api the map_* might be ok if things were all new but they r not stick with the axis= +1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything |
OK yes, I would also be on board with this, +1. Then there's be no need to break API or behaviour |
Me too! |
But isn't what you are proposing for Is there a case that your proposal for If we deprecate/remove |
yes exactly - the point is you can just use transform for a same size return shape, and raise if that is not the case apply was originally meant to be a one sized fits all but has a lot of warts because of this flexibility |
+1. I agree that deprecation/removal is the best course here. |
Ok, I can see a consensus is coming along, which is great and I agree apply/agg/transform would be very similar after this PDEP. A thought: Are we talking about deprecating only >>> ser = pd.Series([1, 2, 3,])
>>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
>>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])
>>> ser.apply([func1, func2])
<lambda> <lambda>
0 4 8
1 3 3 (i.e. the result is not a transformation nor an aggregation). My guess is it will be be done by converting to a DataFrame and back like |
I thought we were talking about both
Easy, simple, predictable. My favourite things
I'd be OK with it not being a core feature. You can just call the function directly if you really need it: In [48]: >>> ser = pd.Series([1, 2, 3,])
...: >>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
...: >>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])
In [49]: pd.concat([func1(ser), func2(ser)], axis=1, keys=['func1', 'func2'])
Out[49]:
func1 func2
0 4 8
1 3 3 |
I'd be fine with deprecating apply in both Series and DataFrame. My issue above was with having an BTW: We should probably deprecate groupby.apply also, if we deprecate |
groupby.apply is a bit different, I think it operates on the entire subdataframe corresponding to the group? Not sure there's a replacement, so I think it'd be fine to keep it (or in any case, to keep it to a separate discussion - sorting out |
I'm not sure that we can deprecate In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})
In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]
In [55]: df.apply(func, axis=1)
Out[55]:
0 10
1 2
2 30
dtype: int64 I made up an odd function, but the ability to apply a function to each row of a |
can't you do that with DataFrame.agg: In [8]: In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})
...:
...: In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]
In [9]: df.agg(func, axis=1)
Out[9]:
0 10
1 2
2 30
dtype: int64 ? |
Wasn't aware of that! Sounds like |
@rhshadrach , do you have an opinion, I know you've looked at the groupby apply/agg/transform methods? I've looked into it a bit and to me it does look quite complex, so I'm not sure if groupby.apply does things differently than groupby.agg and groupby.transform. |
Just some info: In order to deprecate apply, we will have to add a keyword parameter to Example: >>> df = pd.DataFrame({"x":range(100_000)})
>>> %timeit df.transform(lambda x: x + 1). # operates on the series, fast
142 µs ± 589 ns per loop
>>> %timeit df["x"].transform(lambda x: x + 1). # operates on elements, slow
15.5 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) So the behavioral differences are also present here, like for apply. |
|
||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a 4th When users want to neither aggregate or transform a ..., they should ...
Some thoughts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from my comments, should replace all uses of "&" with "and". I don't think I caught them all
|
||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to have the case when users want to have an operation on a row of a DataFrame
, they should use DataFrame.agg(func, axis=1)
. Today a common use case is using DataFrame.apply(func, axis=1)
. Since func
can do anything, it isn't entirely clear that agg
with axis=1
will supply each row of the series to the function, meaning that the corresponding function doesn't need to "aggregate".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.apply(func, axis=1)
is just DataFrame.T.apply(func).T
, so any issue with axis=1
will also be present with axis=0
. So I don't see a special issue with axis=1
, except maybe some idioms that are more common with axis=1
(which may be a valid concern, though).
Can you give some example code snippets showing some possible problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642
Inclined to agree that .agg
feels really unnatural here
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||
|
||
The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to have some examples of those "very few cases"
|
||
Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable. | ||
|
||
(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you add the string representations of functions to the tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even the behavior with strings is confusing. Here's some examples to illustrate:
>>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0 1
1 2
2 3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0 1.0
1 2.0
2 3.0
dtype: float64
>>> s.apply(abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply(np.abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply("abs")
0 1
1 2
2 3
dtype: int64
|
||
In addition to the above effects of the current implementation of `agg`/`transform` & `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented. | ||
|
||
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`: | |
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` or `map`, if `agg` and `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`: |
| lambda x: x + 1 | .transform | .transform | | ||
| [lambda x: x.sum()] | .agg | .agg | | ||
|
||
So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be really worthwhile to include some axis=1
examples for DataFrame.apply()
|
||
With the above in mind, it is proposed that in the future: | ||
|
||
It is proposed that `apply`, `transform` and `agg` in the future will work as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is proposed that `apply`, `transform` and `agg` in the future will work as follows: | |
It is proposed that `apply`, `transform` and `agg` will work as follows: |
(You have "in the future" in 2 consecutive sentences)
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||
|
||
The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this text seems to be repeated from the abstract. So shorten the abstract - leave the details here
|
||
The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||
|
||
It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases. | |
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases. |
To change the current behavior, it will have to be deprecated. This will be done by in v2.2: | ||
|
||
1. Deprecate `Series.apply` & `DataFrame.apply`. | ||
2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future. | |
2. Add a `series_ops_only` argument with type `bool | lib.NoDefault` parameter to `agg` and `transform` methods of `Series` and `DataFrame` with a default value of `lib.NoDefault`. When `series_ops_only` is set to `False`, `agg` and `transform` will behave as they do currently. When set to `True`, `agg` and `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` and `transform` will behave as `series_ops_only=False`, but will emit a `FutureWarning` the current behavior will be removed in the future. |
I've looked into this again and have an issue: We don't have these keyword params from
So deprecation of |
I've changed the deprecaton process according to this. |
I've changed according to comments. Comments on future use of |
|
||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||
|
||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is accurate? For the use cases I've seen of DataFrame.apply
, I think DataFrame.agg
would probably be more likely as a replacement
Currently, say I do
In [43]: df = pd.DataFrame({'a': ['quetzal', 'quetzal', 'baboon'], 'b': ['panda', 'chinchilla', 'elk']})
In [44]: df
Out[44]:
a b
0 quetzal panda
1 quetzal chinchilla
2 baboon elk
In [45]: def func(row):
...: return f"{row['a']} or {row['b']}?"
...:
In [46]: df.apply(func, axis=1)
Out[46]:
0 quetzal or panda?
1 quetzal or chinchilla?
2 baboon or elk?
dtype: object
Then if I use DataFrame.map
, it'll operate over all elements of the dataframe 1-by-1. Whereas if I use .agg
, then I'll preserve what I'm currently doing:
In [47]: df.agg(func, axis=1)
Out[47]:
0 quetzal or panda?
1 quetzal or chinchilla?
2 baboon or elk?
dtype: object
This feels very unnatural though - func
here doesn't look like an aggregation at all, so I wouldn't have reached for .agg
Could DataFrame.map
get an axis
keyword as well, so that DataFrame.map(func, axis=1)
preserves the current behaviour of DataFrame.apply(func, axis=1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform would raise here:
In [53]: df.transform(func, axis=1)
---------------------------------------------------------------------------
ValueError: Function did not transform
And this underscores just how confusing this is
is it columm by column, row by row, or the entire frame. Perhaps they should.
yes, exactly, it could be:
axis=None
: element by element (current behaviour)axis=1
: row by rowaxis=0
: element by element
Then, df.apply(lambda row: ..., axis=1)
can just become df.map(lambda row: ..., axis=1)
This is a very common way of using apply
, e.g.
titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)
from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli , your example is an aggregation though and func
does operate on rows, not elements, so everything works as intended in your agg
example.
However, I guess agg
linguistically(?) is best understood as a numerical reduction, not a text combination operation, so using the name agg
may be a bit unintuitive. My original proposal would have kept apply
for situations like this, because the problem I wanted to solve was apply
giving subtly different results for Series.apply
and DataFrame.apply
(including for axis=1
).
Your example could be an argument for keeping apply
, but drop the element-by-element behavior of Series.apply
, i.e. revert to the original proposal of this PDEP.
I'm -1 on making map
operate on rows/columns. It's a strength of the method that it always gives same-shaped results and it would be confusing for the result changing shape based on if an axis parameter is given or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, it's a horizontal aggregation: it combines multiple columns into 1
I'm just thinking about the upgrade process
I've only ever seen people use DataFrame.apply
to create a single column, and if the upgrade process is going to be:
- Series.apply(func) => Series.map(func)
- DataFrame.apply(func, axis=1) => DataFrame.agg(func, axis=1)
then that could be confusing, and certainly harder than just grepping for .apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1)
to DataFrame.agg(func, axis=1)
to be non-intuitive.
But maybe we should consider introducing a method DataFrame.project(func)
that is the replacement for DataFrame.apply(func, axis=1)
. Idea being that you "project" a function onto each row of the DF.
Or should we make DataFrame.transform(func, axis=1)
work by transforming each row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait let's not add yet another method to the API 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just another voice here to support the notion that requiring agg
for this use case feels very unnatural)
Yeah I see where you are coming from. My idea is to alleviate this by doing good doc updates. Could e.g. Have some common examples showing how to do the change. Do you have some ideas? Btw, I’m not opposed to keeping or deprecating apply (it’s very similar to agg/transform after all), but must admit I’m being a bit worn down by the long process and would really like a decision soonish… |
Sure, good docs + an update guide would be good, but it does become more complicated than just "use map instead" Would adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the DataFrame.apply(func, axis=1)
case has to be resolved. I think that @MarcoGorelli and I are concerned about that particular case.
Also, I think the behavior with functions specified as strings should be addressed, given the inconsistent behavior that I have noted in one of my comments.
|
||
1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise | ||
2. `Series.apply` and `DataFrame.apply` will be deprecated. | ||
3. The current behavior when supplying string to the methods will not be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. The current behavior when supplying string to the methods will not be changed. | |
3. The current behaviors when supplying strings for the function arguments to the methods will not be changed. |
|
||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||
|
||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1)
to DataFrame.agg(func, axis=1)
to be non-intuitive.
But maybe we should consider introducing a method DataFrame.project(func)
that is the replacement for DataFrame.apply(func, axis=1)
. Idea being that you "project" a function onto each row of the DF.
Or should we make DataFrame.transform(func, axis=1)
work by transforming each row?
|
||
Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable. | ||
|
||
(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even the behavior with strings is confusing. Here's some examples to illustrate:
>>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0 1
1 2
2 3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0 1.0
1 2.0
2 3.0
dtype: float64
>>> s.apply(abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply(np.abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply("abs")
0 1
1 2
2 3
dtype: int64
|
||
The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||
|
||
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases. | |
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases. |
My issue is that I would be more positive to keep |
hmm OK, point taken. there'd be a long deprecation process, and plenty of time to teach/warn whether You're right that this has been going for some time, so maybe we could call a vote? EDITas Patrick correctly pointed out, this requires a ping of pandas-core and some time for people to catch up - we've done that now, so let's start the process by which we can call a vote in the future |
This needs a ping of @pandas-dev/pandas-core before we can even consider calling a vote |
Thanks for the ping, Patrick. On a procedural note, in the future it might be best to ping the core team earlier (and notify the pandas-dev mailing list), or at least if you think the PDEP is ready for general discussion. (looking at the discussion from just the past few days, it seems there are still significant open questions even among the proponents of the PDEP, so as someone who still needs to read up on the whole discussion this also seems a bit strange that it would be ready for a vote? But I first now do that, read the actual PDEP and discussion ;)) |
I personally agree with @topper-123 here. One of the nice things about I understand the rationale of wanting to deprecate |
|
||
## Motivation | ||
|
||
The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the motivation to be pretty strong from a developer perspective but I think it is lacking from an end user perspective. To play devil's advocate...why should a non-expert user care or try to understand this distinction?
While there are definitely warts here part of me thinks we've had them for X number of years and gotten by without too much end user complaint about it. I think this would be the largest deprecation we've done on the project as long as I've been involved, so I'm a little wary of causing churn instead of just trying to "softly" guide users towards the more explicit map / agg / transform options
| | Series | DataFrame | groupby | | ||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||
| ufunc or list/dict of ufuncs | series | series | series | | ||
| other callables (non ufunc) | Try elements, fallback to series | series | series | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "try elements" is already deprecated, right?
In [19]: df = pd.DataFrame({"A": range(3)})
In [21]: df["A"].agg(lambda x: np.sum(x))
<ipython-input-21-2e8408eae29f>:1: FutureWarning: using <function <lambda> at 0x7f26752e5ee0> in Series.agg cannot aggregate and has been deprecated. Use Series.transform to keep behavior unchanged.
df["A"].agg(lambda x: np.sum(x))
Out[21]:
0 0
1 1
2 2
Name: A, dtype: int64
|
||
| | Series | DataFrame | groupby | | ||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||
| ufunc or list/dict of ufuncs | series | series | series | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below, it is mentioned that agg
(and others) will always operate on a Series, which might indirectly apply that this line in the table is correct.
But a ufunc
never aggregates, so I assume we don't want this, and the end goal should be that agg
always reduces, and so never accepts a ufunc
?
| | Series | DataFrame | groupby | | ||
|:-----------------------------------|:---------|:------------|:----------| | ||
| ufunc or list/dict of ufuncs | series | series | series | | ||
| other callables (non ufunc) | elements | series | series | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| other callables (non ufunc) | elements | series | series | | |
| other callables (non ufunc) | elements | series | dataframe | |
?
(and this is actually one of the other inconsistencies in the whole groupby side of the story, that we don't have a clear way to choose whether you want to apply your function to each column in the group or on the group as a whole, i.e. as a dataframe)
|:-----------------------------------|:---------|:------------|:----------| | ||
| ufunc or list/dict of ufuncs | series | series | series | | ||
| other callables (non ufunc) | elements | series | series | | ||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series | | |
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | - | |
I am not sure groupby.apply accepts a list or dict?
For example
In [42]: df = pd.DataFrame({"A": range(3)})
In [43]: df.groupby([0, 1, 0]).apply([lambda x: x])
...
TypeError: unhashable type: 'list'
|
||
1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data | ||
2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data | ||
3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of Series.apply
, and in this case (being passed a single function), it always operates element by element, consistently and unambiguously.
Some half-baked thoughts:
|
Let's try to quantify how plausible this is: I grepped
There's a tool which promises to parallelise Searching "what does pandas.Series.apply do" on Google, these are my top results (other than pandas official docs):
Asking chat gpt, I get "In Pandas, the apply() method in the context of a Pandas Series is used to apply a function along the axis of the Series. It can be particularly useful when you want to apply a custom or lambda function to each element of the Series." So yes, I think this is overwhelmingly the most common use case |
I also find Will's #54747 (comment) compelling, though complexities of maintenance ultimately bleeds into user complexities. I would guess most users treat Throwing out another idea for discussion what if we introduce a |
I agree.
We have to distinguish functions(UDF's) that work element-by-element versus functions that aggregate. See examples I wrote here: #54747 (comment) How about we start with
Then for a later release, we deprecate the For the next release, if I think if we did it this way, then most existing code out there would continue to work as it does today, modulo some of the odd behavior that occurs when |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Looks like discussion here has sufficiently stalled without clarity on how to move foward. Going to close for now to clear the queue but feel free to open back up if you want to reengage here |
There has been some demand for a PDEP regarding changes to
Series.apply
proposed in #52140, for example #52140 (comment).I've made a proposal in this PR.
@MarcoGorelli , if you want to add/change to the proposal I'm fine with that, especially regarding styles/formats that are used for PDEPs and I've missed.
EDIT 17/09-2023: I've made a version 2 of the PDEP.