PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

topper-123 · 2023-08-25T06:54:34Z

There has been some demand for a PDEP regarding changes to Series.apply proposed in #52140, for example #52140 (comment).

I've made a proposal in this PR.

@MarcoGorelli , if you want to add/change to the proposal I'm fine with that, especially regarding styles/formats that are used for PDEPs and I've missed.

EDIT 17/09-2023: I've made a version 2 of the PDEP.

MarcoGorelli · 2023-08-25T10:13:24Z

Thanks for writing this up, much appreciated

If Series.apply and DataFrame.apply will both operate on the entire column, then maybe the method could be renamed to something more descriptive?
So:

Series.apply -> Series.map_column
DataFrame.apply -> DataFrame.map_columns
Series.map -> Series.map_rows
DataFrame.apply(..., axis=1) -> DataFrame.map_rows
deprecate axis argument in DataFrame.apply
?

(for reference, the Polars API is likely to move in the above direction)

phofl · 2023-08-25T11:43:17Z

web/pandas/pdeps/0013-standardize-apply.md

+This PDEP proposes that:
+
+- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2.
+- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`,


4 months is way too short of a deprecation period for a change that will most likely affect the majority of our users which requires them to modify their code in many different places. There is no easy way to opt into the new behaviour.

Edit: This should go DeprecationWarning -> FutureWarning. This will have a huge impact on downstream libraries as well.

I'm a bit surprised about this. I would expect users who upgrade major versions to run their code through the latest minor version before upgrading and if they get a warning, to fix them before upgrading. Same for libraries, that they would check and fix warnings before upgrading support for major versions.

Only implementing this in v4.0 seems way too slow IMO, if we decide this is a good idea in itself.

@phofl are you sure about

There is no easy way to opt into the new behaviour.

? I thought the way to do that was by passing by_row=True?

If we want this, then the usual deprecation cycle sounds fine tbh, can't wait until 4.0 to resolve such glaring inconsistencies

As for whether we want this...I'm holding on to the idea that there might an alternative which also resolves the inconsistencies. Trying to write something up

bashtage · 2023-08-25T15:15:45Z

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted. Is there a good reason not to just deprecate it with removal in 3 rather than changing behavior?

MarcoGorelli · 2023-08-25T15:45:37Z

Some motivation for that was given here

Series.apply will do many other things than apply a single callable, for example take strings, or lists and dicts, e.g. do Series.apply([np.sum, "mean"])

topper-123 · 2023-08-26T07:34:25Z

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted.

I addition to the point made by @MarcoGorelli, the way I've seen it so far is that we need to have a method that operates on elements of the series (that will be Series.map) and a method that operates on the whole Series (Series.apply), because the characteristics of working element-wise are different than working on the whole series (e.g. performance (points 3 in the PDEP) and different results from the operation).

Of course Series.apply, Series.agg and Series.transform will be quite similar, so are you discussing using only Series.agg and Series.transform and dropping Series.apply?

topper-123 · 2023-08-26T07:58:09Z

if Series.apply and DataFrame.apply will both operate on the entire column, then maybe the method could be renamed to something more descriptive?

I have not thought much about this. A few things I would think about here:

I assume this means we drop .agg & .transform also?
we should differentiate between operations that are element-wise and series-wise, how does this do that?

More generally, naming of of this type of methods is of course very fundamental to pandas, and a different, though related, subject than this PDEP. So maybe take that in a separate issue/PDEP, and if that is a serious path forward, I can withdraw this PDEP.

Dr-Irv

I think the main issue here is that the current behavior of Series.apply() is dependent on whether the passed function is an aggregation-like function (e.g., sum, abs, min, etc.) or a true ufunc that operates on a scalar (e.g., abs, np.sqrt, len, np.sin, etc.)

Your write-up is mixing up these use cases, and I think that is the main source of confusion for users.

It seems to me that we should have people no longer use Series.apply at all, and have them use Series.agg() if the function is aggregation-like, and use Series.map() or Series.transform() if the function is a true ufunc.

web/pandas/pdeps/0013-standardize-apply.md

Dr-Irv · 2023-08-26T14:22:21Z

web/pandas/pdeps/0013-standardize-apply.md

+
+Currently, giving a input to `Series.apply` is treated differently depending on the type of the input:
+
+* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.


Suggested change

* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.

* if the input is a numpy `ufunc` or strings such as `"sum"`, `"min"`, etc.,, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.

Strings are a bit different than that, and series.apply(string) is more equivalent to getattr(series, string)(). I've added a line about strings.

Dr-Irv · 2023-08-26T14:23:45Z

web/pandas/pdeps/0013-standardize-apply.md

+
+* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
+* if the input is a callable, but not a numpy `ufunc`, `series.apply(func)` is similar to `Series([func(val) for val in series], index=series.index)`, i.e. similar to `series.map(func)`
+* if the input is a list-like or dict-like, `series.apply(func)` is equivalent to `series.agg(func)` (which is subtly different than `series.apply`)


I think you mean "list-like or dict-like" of callables, including numpy ufunc callables". You don't include something like s.apply([1,2,3]), which is list-like but won't work. So maybe be more precise here?

I've made it more precise, my intention was not to exclude strings. e.g. s.apply(["sqrt", np.sqrt]) is also possible.

Also, the described behavior is the actual behavior, so e.g. a list of ints will be passed through and will end up calling each int and fail. So the current text is accurate.

Dr-Irv · 2023-08-26T14:24:30Z

web/pandas/pdeps/0013-standardize-apply.md

+
+* if the input is a callable, `df.apply(func)` always calls each columns in the DataFrame, so is similar to `func(col) for _, col in
+df.items()` + wrapping functionality
+* if the input is a list-like or dict-like, `df.apply` call each item in the list/dict and wraps the result as needed. So for example if the input is a list, `df.apply(func_list)` is equivalent to `[df.apply(func) for func in func_list]` + wrapping functionality


Probably should add what happens with DataFrame.apply(func_or_list, axis=1)

I've added something about this.

Dr-Irv · 2023-08-26T14:28:09Z

web/pandas/pdeps/0013-standardize-apply.md

+
+- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2.
+- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`,
+- Lists or dicts of callables given to the `Series.apply` will in Pandas 3.0 always call `Series.apply` on each element of the list/dict


You should be more precise here. I think you mean something like "Lists or dicts of callables given to Series.apply will apply each callable to the Series, so that the result of s.apply(list_of_callables) is Series([func[s] for s in list_of_callables)

That won't expand list or dicts to Dataframes, e.g. ser.apply(["sqrt", np.sqrt]) is a dataframe.

I've made a new formulation to clarify this.

web/pandas/pdeps/0013-standardize-apply.md

Dr-Irv · 2023-08-26T15:02:37Z

web/pandas/pdeps/0013-standardize-apply.md

+dtype: int64
+```
+
+Users would expect these two to give the same result.


Ummm, not so sure about that. What about the case of np.abs() or just abs ? In that case Series.agg(abs) gets applied element-wise!

Those will give the same results, so the text is accurate?

I agree we cannot be guaranteed that ser.agg(func) will actually aggregate, but that is IMO a different discussion and maybe part of the discussion that has been opened about changing the API for this class of methods.

web/pandas/pdeps/0013-standardize-apply.md

Dr-Irv · 2023-08-26T15:03:43Z

web/pandas/pdeps/0013-standardize-apply.md

+dtype: float64
+```
+
+These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example.


I agree with you here that this inconsistency is hard to understand.

Dr-Irv · 2023-08-26T15:06:19Z

web/pandas/pdeps/0013-standardize-apply.md

+
+## Deprecation process
+
+To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise.


The question is what would be the default behavior if by_row is not specified?

That will emit a deprecation warning in v2.0. I've made text change to clarify this.

Dr-Irv · 2023-08-26T15:14:30Z

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted. Is there a good reason not to just deprecate it with removal in 3 rather than changing behavior?

I think this is a good idea.

I also considered going further and eliminating apply() throughout the API, pointing people to the proper methods of agg() or transform() and maybe consider eliminating map, since it is the same as transform() (I think). But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

Dr-Irv · 2023-08-26T15:16:25Z

Of course Series.apply, Series.agg and Series.transform will be quite similar, so are you discussing using only Series.agg and Series.transform and dropping Series.apply?

Yes, I think that is what @bashtage is suggesting, and I think it is worth considering.

topper-123 · 2023-08-27T09:31:34Z

maybe consider eliminating map, since it is the same as transform() (I think).

map will be different than transform because map will operate on each element, while transform will operate on the whole Series. So transform will be much faster than map, while map will be more flexible than transform. So we should def keep map IMO.

But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

If we keep DataFrame.apply, we should IMO have an equivalent Series.apply.

MarcoGorelli · 2023-08-27T16:15:01Z

First, thanks again to you and @rhshadrach for your work here. There's indeed a tonne of inconsistencies which need resolving, and I think everybody is on board with that.
The question is - in which direction to resolve them?

If I've understood correct, the main issues you want to address are:

inconsistency between df[['a']].apply(func) and df['a'].apply(func)
inconsistency between df['a'].apply('sum') and df['a'].apply(lambda x: sum(x))
inconsistency between df['a'].agg(np.sum) and df['a'].apply(np.sum)

One solution (which you're proposing) would be to change Series.apply to work series-wise, as opposed to element-wise.

My issue with this is that, as far as I can tell, users are already only thinking of apply as an element-wise operation.

From my Kaggle scrapes*, I only see:

DataFrame.apply with axis=1 (i.e. row-wise). E.g.:
- a-journey-through-titanic
- summary-functions-and-maps.ipynb
Series.apply intended to be used element-wise. E.g.:
- a-statistical-analysis-ml-workflow-of-titanic.ipynb
- a-study-on-regression-applied-to-the-ames-dataset.ipynb
- basic-eda-cleaning-and-glove.ipynb
- bert-for-humans-tutorial-baseline.ipynb
- customer-segmentation.ipynb
- ... (too many to list, this is super common)

I couldn't find a single case of DataFrame.apply with axis=0, nor of any kind of apply with a string or a list, nor any case of Series.agg or DataFrame.agg. TBH, before reading the linked issue, I didn't even know most of these were allowed.

If we're going to make breaking changes, I'd suggest that they be made in the direction of keeping apply work element-wise. That's why I'm suggesting:

DataFrame.apply(..., axis=0) -> DataFrame.map_columns
DataFrame.apply(..., axis=1) -> DataFrame.map_rows
Series.apply(callable) -> Series.map_rows
Series.apply("sum") -> Series.map_column

Then:

df[['a']].map_rows(func) would be consistent with df['a'].map_rows(func)
df['a'].map_columns('sum') and df['a'].map_column(lambda x: sum(x)) would be consistent
df['a'].agg(np.sum) and df['a'].map_column(np.sum) would be consistent

To respond to your questions:

I assume this means we drop .agg & .transform also?

I don't see why, no. DataFrame.agg could stay, but the docs would be changed to say "If a function, must either work when passed a DataFrame or when passed to DataFrame.map_columns", and for Series.agg the docs would mention Series.map_column

we should differentiate between operations that are element-wise and series-wise, how does this do that?

The name tells you that. So, map_rows means that the callable works on each row, and map_columns means it happens on each column.

Regarding the deprecation process, the advantage of moving to a new function name is that you tell people to just use that. For example:

>>> df['a'].apply(func)
FutureWarning: `Series.apply` is deprecated. Please use either `Series.map_rows`, if you want to operate element-wise, or `Series.map_column`, if you want to operate on the whole column.

and users can just change their code once.

Conversely, the PDEP's suggested by_rows argument would mean that users need to start using a new argument to avoid a warning, and then remove it because it's deprecated (i.e. change their code twice).

Sorry if I've missed anything fundamental, and thanks for your patience pushing this through. I really appreciate it, and agree that these inconsistencies really need addressing
Happy to have a call if you think that'd help reach consensus sooner

jreback · 2023-08-27T16:25:15Z

@MarcoGorelli i am not sure why you are trying to break with the pandas api

the map_* might be ok if things were all new but they r not

stick with the axis=

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

MarcoGorelli · 2023-08-27T16:36:31Z

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

OK yes, I would also be on board with this, +1. Then there's be no need to break API or behaviour

Dr-Irv · 2023-08-27T16:39:58Z

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

OK yes, I would also be on board with this, +1. Then there's be no need to break API or behaviour

Me too!

Dr-Irv · 2023-08-27T16:42:15Z

But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

If we keep DataFrame.apply, we should IMO have an equivalent Series.apply.

But isn't what you are proposing for Series.apply() equivalent to what can be done with Series.agg() ?

Is there a case that your proposal for Series.apply() does something different than what Series.agg() does?

If we deprecate/remove Series.apply() all together, we just point people to Series.agg() and Series.map() as replacements.

jreback · 2023-08-27T17:01:16Z

If we deprecate/remove Series.apply() all together, we just point people to Series.agg() and Series.map() as replacements.

yes exactly - the point is you can just use transform for a same size return shape,
agg for a reduction

and raise if that is not the case

apply was originally meant to be a one sized fits all but has a lot of warts because of this flexibility

bashtage · 2023-08-29T07:56:30Z

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

+1.

I agree that deprecation/removal is the best course here. Series.apply has too many magic behaviors, and the alternatives provide a better API where the intent of the call is unambiguous.

topper-123 · 2023-08-29T08:53:28Z

Ok, I can see a consensus is coming along, which is great and I agree apply/agg/transform would be very similar after this PDEP.

A thought:

Are we talking about deprecating only Series.apply i.e. not apply on DataFrame and groupby objects? How would we do the below without Series.apply (after the changes in this PDEP)?:

>>> ser = pd.Series([1, 2, 3,])
>>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
>>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])
>>> ser.apply([func1, func2])
   <lambda>  <lambda>
0         4         8
1         3         3

(i.e. the result is not a transformation nor an aggregation).

My guess is it will be be done by converting to a DataFrame and back like ser.to_frame().apply([func1, func2]).iloc[:, 0]. so the question is if we're ok with that instead of having a Series.apply method. It won't be complex to maintain Series.apply after the PDEP is implemented, because we need to have Series functionality internally anyway in order to do columns/row operations in DataFrame.apply.

MarcoGorelli · 2023-08-29T09:28:10Z

Are we talking about deprecating only Series.apply i.e. not apply on DataFrame and groupby objects?

I thought we were talking about both Series.apply and DataFrame.apply, and to point people towards:

map for elementwise
transform and agg for operating on the whole column

Easy, simple, predictable. My favourite things

func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])

I'd be OK with it not being a core feature. You can just call the function directly if you really need it:

In [48]: >>> ser = pd.Series([1, 2, 3,])
    ...: >>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
    ...: >>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])

In [49]: pd.concat([func1(ser), func2(ser)], axis=1, keys=['func1', 'func2'])
Out[49]:
   func1  func2
0      4      8
1      3      3

topper-123 · 2023-08-29T10:04:51Z

I'd be fine with deprecating apply in both Series and DataFrame. My issue above was with having an DataFrame.apply method that clearly works on Series, but not having the corresponding series functionality. I agree that using pd.concat is fine in the example I showed.

BTW: We should probably deprecate groupby.apply also, if we deprecate Series|DataFrame.apply?

MarcoGorelli · 2023-08-29T10:42:12Z

groupby.apply is a bit different, I think it operates on the entire subdataframe corresponding to the group? Not sure there's a replacement, so I think it'd be fine to keep it (or in any case, to keep it to a separate discussion - sorting out Series.apply and DataFrame.apply would already be a win 🏆 )

Dr-Irv · 2023-08-29T12:08:59Z

I'd be fine with deprecating apply in both Series and DataFrame. My issue above was with having an DataFrame.apply method that clearly works on Series, but not having the corresponding series functionality. I agree that using pd.concat is fine in the example I showed.

I'm not sure that we can deprecate DataFrame.apply(), because of this kind of use case:

In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})

In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]

In [55]: df.apply(func, axis=1)
Out[55]:
0    10
1     2
2    30
dtype: int64

I made up an odd function, but the ability to apply a function to each row of a DataFrame is a common usage in code from my team.

MarcoGorelli · 2023-08-29T12:15:29Z

can't you do that with DataFrame.agg:

In [8]: In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})
   ...:
   ...: In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]

In [9]: df.agg(func, axis=1)
Out[9]:
0    10
1     2
2    30
dtype: int64

?

Dr-Irv · 2023-08-29T13:34:27Z

can't you do that with DataFrame.agg:

Wasn't aware of that! Sounds like DataFrame.apply() can disappear as well!

topper-123 · 2023-08-30T09:03:07Z

groupby.apply is a bit different, I think it operates on the entire subdataframe corresponding to the group? Not sure there's a replacement, so I think it'd be fine to keep it (or in any case, to keep it to a separate discussion - sorting out Series.apply and DataFrame.apply would already be a win 🏆 )

@rhshadrach , do you have an opinion, I know you've looked at the groupby apply/agg/transform methods? I've looked into it a bit and to me it does look quite complex, so I'm not sure if groupby.apply does things differently than groupby.agg and groupby.transform.

topper-123 · 2023-08-30T09:19:09Z

Just some info: In order to deprecate apply, we will have to add a keyword parameter to Series.transform, because currently we fall back to use Series.apply there (see Apply.transform_str_or_callablein pandas/core/apply.py).

Example:

>>> df = pd.DataFrame({"x":range(100_000)})
>>> %timeit df.transform(lambda x: x + 1). # operates on the series, fast
142 µs ± 589 ns per loop
>>> %timeit df["x"].transform(lambda x: x + 1). # operates on elements, slow
15.5 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the behavioral differences are also present here, like for apply.

mroeschke · 2023-09-18T17:26:23Z

web/pandas/pdeps/0013-standardize-apply.md

+
+1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
+2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
+3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.


Could you add a 4th When users want to neither aggregate or transform a ..., they should ...

mroeschke · 2023-09-18T17:33:31Z

Some thoughts

Could you include the behavior of rolling.apply/rolling.agg for completeness
I would opt for a slower deprecation process of apply given how commonly used it is in the wild

Remove all non-necessary apply usages from the documenation
DeprecationWarning in 2.x
FutureWarning in 3.0
Removal 4.0

Dr-Irv

Aside from my comments, should replace all uses of "&" with "and". I don't think I caught them all

Dr-Irv · 2023-09-18T17:36:25Z

web/pandas/pdeps/0013-standardize-apply.md

+
+1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
+2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
+3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.


I think you need to have the case when users want to have an operation on a row of a DataFrame, they should use DataFrame.agg(func, axis=1) . Today a common use case is using DataFrame.apply(func, axis=1). Since func can do anything, it isn't entirely clear that agg with axis=1 will supply each row of the series to the function, meaning that the corresponding function doesn't need to "aggregate".

DataFrame.apply(func, axis=1) is just DataFrame.T.apply(func).T, so any issue with axis=1 will also be present with axis=0. So I don't see a special issue with axis=1, except maybe some idioms that are more common with axis=1 (which may be a valid concern, though).

Can you give some example code snippets showing some possible problems?

@Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642

Inclined to agree that .agg feels really unnatural here

web/pandas/pdeps/0013-standardize-apply.md

Dr-Irv · 2023-09-18T17:40:31Z

web/pandas/pdeps/0013-standardize-apply.md

+2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
+3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
+
+The use of `Series.apply` &  `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` &  `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.


It may be useful to have some examples of those "very few cases"

Dr-Irv · 2023-09-18T17:42:14Z

web/pandas/pdeps/0013-standardize-apply.md

+
+Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.
+
+(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).


should you add the string representations of functions to the tables?

The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods.

But even the behavior with strings is confusing. Here's some examples to illustrate:

>>> s = pd.Series([1,2,3]) >>> s.apply("sum") 6 >>> s.apply(np.sum) 0 1 1 2 2 3 dtype: int32 >>> s.apply("mean") 2.0 >>> s.apply(np.mean) 0 1.0 1 2.0 2 3.0 dtype: float64 >>> s.apply(abs) 0 1 1 2 2 3 dtype: int64 >>> s.apply(np.abs) 0 1 1 2 2 3 dtype: int64 >>> s.apply("abs") 0 1 1 2 2 3 dtype: int64

Dr-Irv · 2023-09-18T17:44:23Z

web/pandas/pdeps/0013-standardize-apply.md

+
+In addition to the above effects of the current implementation of `agg`/`transform` & `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented.
+
+It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:


Suggested change

It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:

It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` or `map`, if `agg` and `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:

Dr-Irv · 2023-09-18T17:45:17Z

web/pandas/pdeps/0013-standardize-apply.md

+| lambda x: x + 1     | .transform | .transform  |
+| [lambda x: x.sum()] | .agg       | .agg        |
+
+So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`.


Would be really worthwhile to include some axis=1 examples for DataFrame.apply()

Dr-Irv · 2023-09-18T17:45:55Z

web/pandas/pdeps/0013-standardize-apply.md

+
+With the above in mind, it is proposed that in the future:
+
+It is proposed that `apply`, `transform` and `agg` in the future will work as follows:


Suggested change

It is proposed that `apply`, `transform` and `agg` in the future will work as follows:

It is proposed that `apply`, `transform` and `agg` will work as follows:

(You have "in the future" in 2 consecutive sentences)

Dr-Irv · 2023-09-18T17:47:26Z

web/pandas/pdeps/0013-standardize-apply.md

+2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
+3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
+
+The use of `Series.apply` &  `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` &  `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.


this text seems to be repeated from the abstract. So shorten the abstract - leave the details here

Dr-Irv · 2023-09-18T17:48:03Z

web/pandas/pdeps/0013-standardize-apply.md

+
+The use of `Series.apply` &  `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` &  `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.
+
+It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.


Suggested change

It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.

It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.

Dr-Irv · 2023-09-18T17:50:44Z

web/pandas/pdeps/0013-standardize-apply.md

+To change the current behavior, it will have to be deprecated. This will be done by in v2.2:
+
+1. Deprecate `Series.apply` & `DataFrame.apply`.
+2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future.


Suggested change

2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future.

2. Add a `series_ops_only` argument with type `bool | lib.NoDefault` parameter to `agg` and `transform` methods of `Series` and `DataFrame` with a default value of `lib.NoDefault`. When `series_ops_only` is set to `False`, `agg` and `transform` will behave as they do currently. When set to `True`, `agg` and `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` and `transform` will behave as `series_ops_only=False`, but will emit a `FutureWarning` the current behavior will be removed in the future.

topper-123 · 2023-09-20T06:25:39Z

I've looked into this again and have an issue: We don't have these keyword params from DataFrame.apply in agg or transform:

raw
result_type
engine (new in v2.2)
engine_kwargs (new in v2.2)

So deprecation of DataFrame.apply with implementing these in at least DataFrame.agg & DataFrame.transform will lose functionality. So I think these should be added to DataFrame.agg & DataFrame.transform?

topper-123 · 2023-09-20T06:33:07Z

* Remove all non-necessary apply usages from the documenation

* DeprecationWarning in 2.x

* FutureWarning in 3.0

* Removal 4.0

I've changed the deprecaton process according to this.

topper-123 · 2023-09-20T07:41:57Z

I've changed according to comments. Comments on future use of raw, result_type, engine and engine_kwargs parameters?

MarcoGorelli · 2023-09-28T10:20:06Z

web/pandas/pdeps/0013-standardize-apply.md

+
+The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:
+
+1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.


Are we sure this is accurate? For the use cases I've seen of DataFrame.apply, I think DataFrame.agg would probably be more likely as a replacement

Currently, say I do

In [43]: df = pd.DataFrame({'a': ['quetzal', 'quetzal', 'baboon'], 'b': ['panda', 'chinchilla', 'elk']}) In [44]: df Out[44]: a b 0 quetzal panda 1 quetzal chinchilla 2 baboon elk In [45]: def func(row): ...: return f"{row['a']} or {row['b']}?" ...: In [46]: df.apply(func, axis=1) Out[46]: 0 quetzal or panda? 1 quetzal or chinchilla? 2 baboon or elk? dtype: object

Then if I use DataFrame.map, it'll operate over all elements of the dataframe 1-by-1. Whereas if I use .agg, then I'll preserve what I'm currently doing:

In [47]: df.agg(func, axis=1) Out[47]: 0 quetzal or panda? 1 quetzal or chinchilla? 2 baboon or elk? dtype: object

This feels very unnatural though - func here doesn't look like an aggregation at all, so I wouldn't have reached for .agg

Could DataFrame.map get an axis keyword as well, so that DataFrame.map(func, axis=1) preserves the current behaviour of DataFrame.apply(func, axis=1)?

Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should.

transform would raise here:

In [53]: df.transform(func, axis=1) --------------------------------------------------------------------------- ValueError: Function did not transform

And this underscores just how confusing this is

is it columm by column, row by row, or the entire frame. Perhaps they should.

yes, exactly, it could be:

axis=None: element by element (current behaviour)

axis=1: row by row

axis=0: element by element

Then, df.apply(lambda row: ..., axis=1) can just become df.map(lambda row: ..., axis=1)

This is a very common way of using apply, e.g.

titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)

from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic

@MarcoGorelli , your example is an aggregation though and func does operate on rows, not elements, so everything works as intended in your agg example.

However, I guess agg linguistically(?) is best understood as a numerical reduction, not a text combination operation, so using the name agg may be a bit unintuitive. My original proposal would have kept apply for situations like this, because the problem I wanted to solve was apply giving subtly different results for Series.apply and DataFrame.apply (including for axis=1).

Your example could be an argument for keeping apply, but drop the element-by-element behavior of Series.apply, i.e. revert to the original proposal of this PDEP.

I'm -1 on making map operate on rows/columns. It's a strength of the method that it always gives same-shaped results and it would be confusing for the result changing shape based on if an axis parameter is given or not.

true, it's a horizontal aggregation: it combines multiple columns into 1

I'm just thinking about the upgrade process

I've only ever seen people use DataFrame.apply to create a single column, and if the upgrade process is going to be:

Series.apply(func) => Series.map(func)

DataFrame.apply(func, axis=1) => DataFrame.agg(func, axis=1)

then that could be confusing, and certainly harder than just grepping for .apply

I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1) to DataFrame.agg(func, axis=1) to be non-intuitive.

But maybe we should consider introducing a method DataFrame.project(func) that is the replacement for DataFrame.apply(func, axis=1) . Idea being that you "project" a function onto each row of the DF.

Or should we make DataFrame.transform(func, axis=1) work by transforming each row?

wait let's not add yet another method to the API 😅

(just another voice here to support the notion that requiring agg for this use case feels very unnatural)

topper-123 · 2023-09-29T09:50:07Z

Yeah I see where you are coming from. My idea is to alleviate this by doing good doc updates. Could e.g. Have some common examples showing how to do the change. Do you have some ideas?

Btw, I’m not opposed to keeping or deprecating apply (it’s very similar to agg/transform after all), but must admit I’m being a bit worn down by the long process and would really like a decision soonish…

MarcoGorelli · 2023-09-29T11:22:11Z

Sure, good docs + an update guide would be good, but it does become more complicated than just "use map instead"

Would adding axis to map be a blocker for you?

Dr-Irv

I think that the DataFrame.apply(func, axis=1) case has to be resolved. I think that @MarcoGorelli and I are concerned about that particular case.

Also, I think the behavior with functions specified as strings should be addressed, given the inconsistent behavior that I have noted in one of my comments.

Dr-Irv · 2023-09-29T14:51:18Z

web/pandas/pdeps/0013-standardize-apply.md

+
+1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise
+2. `Series.apply` and `DataFrame.apply` will be deprecated.
+3. The current behavior when supplying string to the methods will not be changed.


Suggested change

3. The current behavior when supplying string to the methods will not be changed.

3. The current behaviors when supplying strings for the function arguments to the methods will not be changed.

Dr-Irv · 2023-09-29T14:59:43Z

web/pandas/pdeps/0013-standardize-apply.md

+
+The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:
+
+1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.


I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1) to DataFrame.agg(func, axis=1) to be non-intuitive.

But maybe we should consider introducing a method DataFrame.project(func) that is the replacement for DataFrame.apply(func, axis=1) . Idea being that you "project" a function onto each row of the DF.

Or should we make DataFrame.transform(func, axis=1) work by transforming each row?

Dr-Irv · 2023-09-29T15:06:15Z

web/pandas/pdeps/0013-standardize-apply.md

+
+Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.
+
+(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).


But even the behavior with strings is confusing. Here's some examples to illustrate:

>>> s = pd.Series([1,2,3]) >>> s.apply("sum") 6 >>> s.apply(np.sum) 0 1 1 2 2 3 dtype: int32 >>> s.apply("mean") 2.0 >>> s.apply(np.mean) 0 1.0 1 2.0 2 3.0 dtype: float64 >>> s.apply(abs) 0 1 1 2 2 3 dtype: int64 >>> s.apply(np.abs) 0 1 1 2 2 3 dtype: int64 >>> s.apply("abs") 0 1 1 2 2 3 dtype: int64

Dr-Irv · 2023-09-29T15:09:05Z

web/pandas/pdeps/0013-standardize-apply.md

+
+The use of `Series.apply` and  `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and  `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.
+
+It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.


Suggested change

It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.

It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.

topper-123 · 2023-10-01T08:51:15Z

Would adding axis to map be a blocker for you?

My issue is that axis=1will be a different operation type than axis=0, i.e. one would work element by element and the other row by row and yield subtly different results, and we'd get similar problems as described in #52140.

I would be more positive to keep apply, so it can be used for row/column operations that don't fit into agg/transform, but then we'd then have to fix the issue with Series.apply somehow. For example in your example I'd expect df.iloc[0].apply(func) to give "quetzal or panda?", but currently it fails.

MarcoGorelli · 2023-10-01T10:37:07Z

hmm OK, point taken. there'd be a long deprecation process, and plenty of time to teach/warn whether df.agg(..., axis=1) or df.transform(..., axis=1) is the right replacement. We could write an update guide, and link to that in the deprecation warnings, and give plenty of "how to update your code" examples

You're right that this has been going for some time, so maybe we could call a vote?

EDIT

as Patrick correctly pointed out, this requires a ping of pandas-core and some time for people to catch up - we've done that now, so let's start the process by which we can call a vote in the future

phofl · 2023-10-01T10:49:41Z

This needs a ping of @pandas-dev/pandas-core before we can even consider calling a vote

jorisvandenbossche · 2023-10-02T13:19:01Z

Thanks for the ping, Patrick. On a procedural note, in the future it might be best to ping the core team earlier (and notify the pandas-dev mailing list), or at least if you think the PDEP is ready for general discussion.

(looking at the discussion from just the past few days, it seems there are still significant open questions even among the proponents of the PDEP, so as someone who still needs to read up on the whole discussion this also seems a bit strange that it would be ready for a vote? But I first now do that, read the actual PDEP and discussion ;))

jorisvandenbossche · 2023-10-02T13:29:42Z

Would adding axis to map be a blocker for you?

My issue is that axis=1will be a different operation type than axis=0, i.e. one would work element by element and the other row by row and yield subtly different results, and we'd get similar problems as described in #52140.

I personally agree with @topper-123 here. One of the nice things about map currently is that its scope is very clear: it simply maps your function or mapping to each individual element of the Series (or DataFrame), and that's it.
Expanding that to allow row-wise operations (and then also column-wise operations?) would be a step back for map IMO.

I understand the rationale of wanting to deprecate Series.apply (I haven't yet made up my mind whether I find those reasons worth the trouble actually doing that, but at least I see the reasons), but I don't really understand the decision to also deprecate DataFrame.apply. This method already works on a series, and so doesn't have the problem of Series.apply of requiring a different kind of UDF (eg DataFrame.apply and groupby.apply are already quite consistent with each other?).

WillAyd · 2023-10-02T14:21:33Z

web/pandas/pdeps/0013-standardize-apply.md

+
+## Motivation
+
+The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations.


I find the motivation to be pretty strong from a developer perspective but I think it is lacking from an end user perspective. To play devil's advocate...why should a non-expert user care or try to understand this distinction?

While there are definitely warts here part of me thinks we've had them for X number of years and gotten by without too much end user complaint about it. I think this would be the largest deprecation we've done on the project as long as I've been involved, so I'm a little wary of causing churn instead of just trying to "softly" guide users towards the more explicit map / agg / transform options

jorisvandenbossche · 2023-10-02T13:35:15Z

web/pandas/pdeps/0013-standardize-apply.md

+|                                    | Series                           | DataFrame                        | groupby   |
+|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
+| ufunc or list/dict of ufuncs       | series                           | series                           | series    |
+| other callables (non ufunc)        | Try elements, fallback to series | series                           | series    |


The "try elements" is already deprecated, right?

In [19]: df = pd.DataFrame({"A": range(3)}) In [21]: df["A"].agg(lambda x: np.sum(x)) <ipython-input-21-2e8408eae29f>:1: FutureWarning: using <function <lambda> at 0x7f26752e5ee0> in Series.agg cannot aggregate and has been deprecated. Use Series.transform to keep behavior unchanged. df["A"].agg(lambda x: np.sum(x)) Out[21]: 0 0 1 1 2 2 Name: A, dtype: int64

jorisvandenbossche · 2023-10-02T13:37:44Z

web/pandas/pdeps/0013-standardize-apply.md

+
+|                                    | Series                           | DataFrame                        | groupby   |
+|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
+| ufunc or list/dict of ufuncs       | series                           | series                           | series    |


Below, it is mentioned that agg (and others) will always operate on a Series, which might indirectly apply that this line in the table is correct.
But a ufunc never aggregates, so I assume we don't want this, and the end goal should be that agg always reduces, and so never accepts a ufunc ?

jorisvandenbossche · 2023-10-02T13:42:59Z

web/pandas/pdeps/0013-standardize-apply.md

+|                                    | Series   | DataFrame   | groupby   |
+|:-----------------------------------|:---------|:------------|:----------|
+| ufunc or list/dict of ufuncs       | series   | series      | series    |
+| other callables (non ufunc)        | elements | series      | series    |


jorisvandenbossche · 2023-10-02T13:45:12Z

web/pandas/pdeps/0013-standardize-apply.md

+|:-----------------------------------|:---------|:------------|:----------|
+| ufunc or list/dict of ufuncs       | series   | series      | series    |
+| other callables (non ufunc)        | elements | series      | series    |
+| list/dict of callables (non-ufunc) | Try elements, fallback to series | series    | series    |


Suggested change

| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series |

| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | - |

I am not sure groupby.apply accepts a list or dict?

For example

In [42]: df = pd.DataFrame({"A": range(3)}) In [43]: df.groupby([0, 1, 0]).apply([lambda x: x]) ... TypeError: unhashable type: 'list'

jorisvandenbossche · 2023-10-02T14:40:09Z

web/pandas/pdeps/0013-standardize-apply.md

+
+1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data
+2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data
+3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data.


Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of Series.apply, and in this case (being passed a single function), it always operates element by element, consistently and unambiguously.

jbrockmendel · 2023-10-02T15:12:54Z

Some half-baked thoughts:

Several comments in this thread about Series.apply being used to operate element-wise being the most common usage (and hence a painful deprecation). I find this plausible but have no strong prior here. If it is the case, what if we did a 180 and deprecated all not-that usages of Series.apply? That would not resolve the problem of Series.apply being inconsistent with EverythingElse.apply, but would at least make Series.apply internally consistent. It would make Series.map completely redundant though. (Not advocating, just spitballing)
My kneejerk reaction to adding more methods or keywords is negative. Especially axis=1, if it turns out that doing so is basically just an alias for df.T.whatever().T.
Will's comment about this being more compelling for devs than for users seems reasonable. For devs or type-conscious users we could introduce a lint check (or maybe something like the ast thing @MarcoGorelli discussed at the sprint) to discourage bad uses of apply?
The big motivator that I haven't seen in this thread was @rhshadrach's discussion of untangling the way that apply/agg/transform/whatever call each other is hard-to-maintain (and therefore likely buggy) code. I don't see anything about that in this thread, so may be remembering that wrong?

MarcoGorelli · 2023-10-02T15:32:23Z

Series.apply being used to operate element-wise being the most common usage (and hence a painful deprecation). I find this plausible but have no strong prior here

Let's try to quantify how plausible this is:

I grepped \.apply( on my scrapes of the top 100 most upvoted Kaggle notebooks, and found:

134 cases (some within the same notebook) of using apply (either Series.apply or DataFrame.apply(..., axis=1)) element-wise
1 case of using Series.apply on the Series (d['sell_price'].apply(np.log1p))

There's a tool which promises to parallelise apply, and their examples shows an element-wise operation: https://github.com/nalepae/pandarallel

Searching "what does pandas.Series.apply do" on Google, these are my top results (other than pandas official docs):

"It simplifies applying a function on each element in a pandas Series" https://www.dataquest.io/blog/tutorial-how-to-use-the-apply-method-in-pandas/
"the apply() function is used to invoke a python function on values of Series" https://www.w3resource.com/pandas/series/series-apply.php
"apply() function is used to execute a function for each element in a Series" https://sparkbyexamples.com/pandas/pandas-series-apply-function-usage/

Asking chat gpt, I get "In Pandas, the apply() method in the context of a Pandas Series is used to apply a function along the axis of the Series. It can be particularly useful when you want to apply a custom or lambda function to each element of the Series."

So yes, I think this is overwhelmingly the most common use case

mroeschke · 2023-10-13T17:45:49Z

I also find Will's #54747 (comment) compelling, though complexities of maintenance ultimately bleeds into user complexities. I would guess most users treat apply as "the way you pass pandas objects into your UDF and get back a pandas object".

Throwing out another idea for discussion what if we introduce a how=None|"element-wise"|"row/column-wise" keyword in apply to more explicitly dictate what is passed into the UDF(s). I think this would allow eventual straightening of the internals (element-wise just calls map), users get to keep using apply, and through keyword deprecations slowly remove the complex falling back behavior.

Dr-Irv · 2023-10-13T19:11:11Z

I also find Will's #54747 (comment) compelling, though complexities of maintenance ultimately bleeds into user complexities. I would guess most users treat apply as "the way you pass pandas objects into your UDF and get back a pandas object".

I agree.

Throwing out another idea for discussion what if we introduce a how=None|"element-wise"|"row/column-wise" keyword in apply to more explicitly dictate what is passed into the UDF(s). I think this would allow eventual straightening of the internals (element-wise just calls map), users get to keep using apply, and through keyword deprecations slowly remove the complex falling back behavior.

We have to distinguish functions(UDF's) that work element-by-element versus functions that aggregate. See examples I wrote here: #54747 (comment)

How about we start with how= None | "by-element" | "by-series", with how=None being the default. Then, for the next release:

Series.apply(func, how="by-element") is same as Series.map(func)
Series.apply(func, how="by-series") is same as func(Series.values)
DataFrame.apply(func, how="by-element") is same as DataFrame.map(func) (and axis argument is ignored)
DataFrame.apply(func, how="by-series", axis=0) is same as applying func to each column of the DataFrame
DataFrame.apply(func, how="by-series", axis=1) is same as applying func to each row of the DataFrame
Series.apply(func, how=None) is same as Series.apply(func) today (complex behavior)
DataFrame.apply(func, how=None) is same as DataFrame.apply(func) today (complex behavior)

Then for a later release, we deprecate the None argument, and change the default behavior, so that for Series the default is by-element and for DataFrame the default is by-series.

For the next release, if None creates a behavior that is different than the future defaults of by-element for Series and by-series for DataFrame, we issue a warning that the user needs to be more explicit.

I think if we did it this way, then most existing code out there would continue to work as it does today, modulo some of the odd behavior that occurs when func is a UDF, vs. an aggregation-type function, vs. a function specified as a string. But that odd behavior creates a warning.

github-actions · 2023-11-13T00:05:41Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-04-23T17:55:42Z

Looks like discussion here has sufficiently stalled without clarity on how to move foward. Going to close for now to clear the queue but feel free to open back up if you want to reengage here

topper-123 requested a review from datapythonista as a code owner August 25, 2023 06:54

topper-123 requested a review from rhshadrach August 25, 2023 07:46

phofl reviewed Aug 25, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Aug 25, 2023

API: make the func in Series.apply always operate on the Series #52140

Open

mroeschke added the PDEP pandas enhancement proposal label Aug 25, 2023

Dr-Irv reviewed Aug 26, 2023

View reviewed changes

update

9d1ac2f

topper-123 force-pushed the master branch from ba3b17c to 9d1ac2f Compare September 17, 2023 21:50

mroeschke reviewed Sep 18, 2023

View reviewed changes

Dr-Irv requested changes Sep 18, 2023

View reviewed changes

update for comments

42eae36

topper-123 changed the title ~~PDEP-13: Deprecate the apply method on Series & DataFrame and make the agg and transform methods operate on series data~~ PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data Sep 20, 2023

small fix

138e4c6

MarcoGorelli reviewed Sep 28, 2023

View reviewed changes

Dr-Irv reviewed Sep 29, 2023

View reviewed changes

WillAyd reviewed Oct 2, 2023

View reviewed changes

jorisvandenbossche reviewed Oct 2, 2023

View reviewed changes

github-actions bot added the Stale label Nov 13, 2023

mroeschke closed this Apr 23, 2024


		Currently, giving a input to `Series.apply` is treated differently depending on the type of the input:

		* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.

	* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
	* if the input is a numpy `ufunc` or strings such as `"sum"`, `"min"`, etc.,, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.


		## Deprecation process

		To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise.


		Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.

		(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).


		In addition to the above effects of the current implementation of `agg`/`transform` & `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented.

		It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:


		With the above in mind, it is proposed that in the future:

		It is proposed that `apply`, `transform` and `agg` in the future will work as follows:


		The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

		It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.


		The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:

		1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.

	3. The current behavior when supplying string to the methods will not be changed.
	3. The current behaviors when supplying strings for the function arguments to the methods will not be changed.


		The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

		It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.


		## Motivation

		The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations.

	\| other callables (non ufunc) \| elements \| series \| series \|
	\| other callables (non ufunc) \| elements \| series \| dataframe \|

	\| list/dict of callables (non-ufunc) \| Try elements, fallback to series \| series \| series \|
	\| list/dict of callables (non-ufunc) \| Try elements, fallback to series \| series \| - \|

PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

Conversation

topper-123 commented Aug 25, 2023 • edited Loading

MarcoGorelli commented Aug 25, 2023 • edited Loading

phofl Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Aug 27, 2023 • edited Loading

Choose a reason for hiding this comment

bashtage commented Aug 25, 2023

MarcoGorelli commented Aug 25, 2023

topper-123 commented Aug 26, 2023

topper-123 commented Aug 26, 2023 • edited Loading

Dr-Irv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Aug 27, 2023 • edited Loading

Choose a reason for hiding this comment

Dr-Irv commented Aug 26, 2023 • edited Loading

Dr-Irv commented Aug 26, 2023 • edited Loading

topper-123 commented Aug 27, 2023

MarcoGorelli commented Aug 27, 2023 • edited Loading

jreback commented Aug 27, 2023

MarcoGorelli commented Aug 27, 2023

Dr-Irv commented Aug 27, 2023

Dr-Irv commented Aug 27, 2023

jreback commented Aug 27, 2023

bashtage commented Aug 29, 2023

topper-123 commented Aug 29, 2023 • edited Loading

MarcoGorelli commented Aug 29, 2023 • edited Loading

topper-123 commented Aug 29, 2023

MarcoGorelli commented Aug 29, 2023

Dr-Irv commented Aug 29, 2023

MarcoGorelli commented Aug 29, 2023

Dr-Irv commented Aug 29, 2023

topper-123 commented Aug 30, 2023

topper-123 commented Aug 30, 2023 • edited Loading

mroeschke Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

mroeschke commented Sep 18, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Sep 20, 2023

topper-123 commented Sep 20, 2023 • edited Loading

topper-123 commented Sep 20, 2023

MarcoGorelli Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

topper-123 Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Aug 25, 2023 •

edited

Loading

MarcoGorelli commented Aug 25, 2023 •

edited

Loading

phofl Aug 25, 2023 •

edited

Loading

MarcoGorelli Aug 27, 2023 •

edited

Loading

topper-123 commented Aug 26, 2023 •

edited

Loading

Dr-Irv left a comment •

edited

Loading

topper-123 Aug 27, 2023 •

edited

Loading

Dr-Irv commented Aug 26, 2023 •

edited

Loading

Dr-Irv commented Aug 26, 2023 •

edited

Loading

MarcoGorelli commented Aug 27, 2023 •

edited

Loading

topper-123 commented Aug 29, 2023 •

edited

Loading

MarcoGorelli commented Aug 29, 2023 •

edited

Loading

topper-123 commented Aug 30, 2023 •

edited

Loading

mroeschke Sep 18, 2023 •

edited

Loading

topper-123 commented Sep 20, 2023 •

edited

Loading

MarcoGorelli Sep 28, 2023 •

edited

Loading

MarcoGorelli Sep 28, 2023 •

edited

Loading

topper-123 Sep 28, 2023 •

edited

Loading

topper-123 commented Sep 29, 2023 •

edited

Loading

MarcoGorelli commented Oct 1, 2023 •

edited

Loading

WillAyd Oct 2, 2023 •

edited

Loading