Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

Closed
wants to merge 13 commits into from

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Aug 25, 2023

There has been some demand for a PDEP regarding changes to Series.apply proposed in #52140, for example #52140 (comment).

I've made a proposal in this PR.

@MarcoGorelli , if you want to add/change to the proposal I'm fine with that, especially regarding styles/formats that are used for PDEPs and I've missed.

EDIT 17/09-2023: I've made a version 2 of the PDEP.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 25, 2023

Thanks for writing this up, much appreciated

If Series.apply and DataFrame.apply will both operate on the entire column, then maybe the method could be renamed to something more descriptive?
So:

  • Series.apply -> Series.map_column
  • DataFrame.apply -> DataFrame.map_columns
  • Series.map -> Series.map_rows
  • DataFrame.apply(..., axis=1) -> DataFrame.map_rows
  • deprecate axis argument in DataFrame.apply
    ?

(for reference, the Polars API is likely to move in the above direction)

This PDEP proposes that:

- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2.
- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`,
Copy link
Member

@phofl phofl Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 months is way too short of a deprecation period for a change that will most likely affect the majority of our users which requires them to modify their code in many different places. There is no easy way to opt into the new behaviour.

Edit: This should go DeprecationWarning -> FutureWarning. This will have a huge impact on downstream libraries as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised about this. I would expect users who upgrade major versions to run their code through the latest minor version before upgrading and if they get a warning, to fix them before upgrading. Same for libraries, that they would check and fix warnings before upgrading support for major versions.

Only implementing this in v4.0 seems way too slow IMO, if we decide this is a good idea in itself.

Copy link
Member

@MarcoGorelli MarcoGorelli Aug 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl are you sure about

There is no easy way to opt into the new behaviour.

? I thought the way to do that was by passing by_row=True?


If we want this, then the usual deprecation cycle sounds fine tbh, can't wait until 4.0 to resolve such glaring inconsistencies

As for whether we want this...I'm holding on to the idea that there might an alternative which also resolves the inconsistencies. Trying to write something up

@bashtage
Copy link
Contributor

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted. Is there a good reason not to just deprecate it with removal in 3 rather than changing behavior?

@MarcoGorelli
Copy link
Member

Some motivation for that was given here

Series.apply will do many other things than apply a single callable, for example take strings, or lists and dicts, e.g. do Series.apply([np.sum, "mean"])

@mroeschke mroeschke added the PDEP pandas enhancement proposal label Aug 25, 2023
@topper-123
Copy link
Contributor Author

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted.

I addition to the point made by @MarcoGorelli, the way I've seen it so far is that we need to have a method that operates on elements of the series (that will be Series.map) and a method that operates on the whole Series (Series.apply), because the characteristics of working element-wise are different than working on the whole series (e.g. performance (points 3 in the PDEP) and different results from the operation).

Of course Series.apply, Series.agg and Series.transform will be quite similar, so are you discussing using only Series.agg and Series.transform and dropping Series.apply?

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 26, 2023

if Series.apply and DataFrame.apply will both operate on the entire column, then maybe the method could be renamed to something more descriptive?

I have not thought much about this. A few things I would think about here:

  • I assume this means we drop .agg & .transform also?
  • we should differentiate between operations that are element-wise and series-wise, how does this do that?

More generally, naming of of this type of methods is of course very fundamental to pandas, and a different, though related, subject than this PDEP. So maybe take that in a separate issue/PDEP, and if that is a serious path forward, I can withdraw this PDEP.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issue here is that the current behavior of Series.apply() is dependent on whether the passed function is an aggregation-like function (e.g., sum, abs, min, etc.) or a true ufunc that operates on a scalar (e.g., abs, np.sqrt, len, np.sin, etc.)

Your write-up is mixing up these use cases, and I think that is the main source of confusion for users.

It seems to me that we should have people no longer use Series.apply at all, and have them use Series.agg() if the function is aggregation-like, and use Series.map() or Series.transform() if the function is a true ufunc.

web/pandas/pdeps/0013-standardize-apply.md Outdated Show resolved Hide resolved

Currently, giving a input to `Series.apply` is treated differently depending on the type of the input:

* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
* if the input is a numpy `ufunc` or strings such as `"sum"`, `"min"`, etc.,, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are a bit different than that, and series.apply(string) is more equivalent to getattr(series, string)(). I've added a line about strings.


* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
* if the input is a callable, but not a numpy `ufunc`, `series.apply(func)` is similar to `Series([func(val) for val in series], index=series.index)`, i.e. similar to `series.map(func)`
* if the input is a list-like or dict-like, `series.apply(func)` is equivalent to `series.agg(func)` (which is subtly different than `series.apply`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean "list-like or dict-like" of callables, including numpy ufunc callables". You don't include something like s.apply([1,2,3]), which is list-like but won't work. So maybe be more precise here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it more precise, my intention was not to exclude strings. e.g. s.apply(["sqrt", np.sqrt]) is also possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the described behavior is the actual behavior, so e.g. a list of ints will be passed through and will end up calling each int and fail. So the current text is accurate.


* if the input is a callable, `df.apply(func)` always calls each columns in the DataFrame, so is similar to `func(col) for _, col in
df.items()` + wrapping functionality
* if the input is a list-like or dict-like, `df.apply` call each item in the list/dict and wraps the result as needed. So for example if the input is a list, `df.apply(func_list)` is equivalent to `[df.apply(func) for func in func_list]` + wrapping functionality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add what happens with DataFrame.apply(func_or_list, axis=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added something about this.


- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2.
- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`,
- Lists or dicts of callables given to the `Series.apply` will in Pandas 3.0 always call `Series.apply` on each element of the list/dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be more precise here. I think you mean something like "Lists or dicts of callables given to Series.apply will apply each callable to the Series, so that the result of s.apply(list_of_callables) is Series([func[s] for s in list_of_callables)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't expand list or dicts to Dataframes, e.g. ser.apply(["sqrt", np.sqrt]) is a dataframe.

I've made a new formulation to clarify this.

web/pandas/pdeps/0013-standardize-apply.md Outdated Show resolved Hide resolved
dtype: int64
```

Users would expect these two to give the same result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm, not so sure about that. What about the case of np.abs() or just abs ? In that case Series.agg(abs) gets applied element-wise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those will give the same results, so the text is accurate?

I agree we cannot be guaranteed that ser.agg(func) will actually aggregate, but that is IMO a different discussion and maybe part of the discussion that has been opened about changing the API for this class of methods.

web/pandas/pdeps/0013-standardize-apply.md Outdated Show resolved Hide resolved
dtype: float64
```

These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you here that this inconsistency is hard to understand.


## Deprecation process

To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what would be the default behavior if by_row is not specified?

Copy link
Contributor Author

@topper-123 topper-123 Aug 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will emit a deprecation warning in v2.0. I've made text change to clarify this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 26, 2023

One this I don't fully understand is whether Series.apply is necessary at all, at least in the case where this PDEP is accepted. Is there a good reason not to just deprecate it with removal in 3 rather than changing behavior?

I think this is a good idea.

I also considered going further and eliminating apply() throughout the API, pointing people to the proper methods of agg() or transform() and maybe consider eliminating map, since it is the same as transform() (I think). But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 26, 2023

Of course Series.apply, Series.agg and Series.transform will be quite similar, so are you discussing using only Series.agg and Series.transform and dropping Series.apply?

Yes, I think that is what @bashtage is suggesting, and I think it is worth considering.

@topper-123
Copy link
Contributor Author

maybe consider eliminating map, since it is the same as transform() (I think).

map will be different than transform because map will operate on each element, while transform will operate on the whole Series. So transform will be much faster than map, while map will be more flexible than transform. So we should def keep map IMO.

But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

If we keep DataFrame.apply, we should IMO have an equivalent Series.apply.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 27, 2023

First, thanks again to you and @rhshadrach for your work here. There's indeed a tonne of inconsistencies which need resolving, and I think everybody is on board with that.
The question is - in which direction to resolve them?

If I've understood correct, the main issues you want to address are:

  • inconsistency between df[['a']].apply(func) and df['a'].apply(func)
  • inconsistency between df['a'].apply('sum') and df['a'].apply(lambda x: sum(x))
  • inconsistency between df['a'].agg(np.sum) and df['a'].apply(np.sum)

One solution (which you're proposing) would be to change Series.apply to work series-wise, as opposed to element-wise.

My issue with this is that, as far as I can tell, users are already only thinking of apply as an element-wise operation.

From my Kaggle scrapes*, I only see:

  • DataFrame.apply with axis=1 (i.e. row-wise). E.g.:
    • a-journey-through-titanic
    • summary-functions-and-maps.ipynb
  • Series.apply intended to be used element-wise. E.g.:
    • a-statistical-analysis-ml-workflow-of-titanic.ipynb
    • a-study-on-regression-applied-to-the-ames-dataset.ipynb
    • basic-eda-cleaning-and-glove.ipynb
    • bert-for-humans-tutorial-baseline.ipynb
    • customer-segmentation.ipynb
    • ... (too many to list, this is super common)

I couldn't find a single case of DataFrame.apply with axis=0, nor of any kind of apply with a string or a list, nor any case of Series.agg or DataFrame.agg. TBH, before reading the linked issue, I didn't even know most of these were allowed.

If we're going to make breaking changes, I'd suggest that they be made in the direction of keeping apply work element-wise. That's why I'm suggesting:

  • DataFrame.apply(..., axis=0) -> DataFrame.map_columns
  • DataFrame.apply(..., axis=1) -> DataFrame.map_rows
  • Series.apply(callable) -> Series.map_rows
  • Series.apply("sum") -> Series.map_column

Then:

  • df[['a']].map_rows(func) would be consistent with df['a'].map_rows(func)
  • df['a'].map_columns('sum') and df['a'].map_column(lambda x: sum(x)) would be consistent
  • df['a'].agg(np.sum) and df['a'].map_column(np.sum) would be consistent

To respond to your questions:

I assume this means we drop .agg & .transform also?

I don't see why, no. DataFrame.agg could stay, but the docs would be changed to say "If a function, must either work when passed a DataFrame or when passed to DataFrame.map_columns", and for Series.agg the docs would mention Series.map_column

we should differentiate between operations that are element-wise and series-wise, how does this do that?

The name tells you that. So, map_rows means that the callable works on each row, and map_columns means it happens on each column.


Regarding the deprecation process, the advantage of moving to a new function name is that you tell people to just use that. For example:

>>> df['a'].apply(func)
FutureWarning: `Series.apply` is deprecated. Please use either `Series.map_rows`, if you want to operate element-wise, or `Series.map_column`, if you want to operate on the whole column.

and users can just change their code once.

Conversely, the PDEP's suggested by_rows argument would mean that users need to start using a new argument to avoid a warning, and then remove it because it's deprecated (i.e. change their code twice).


Sorry if I've missed anything fundamental, and thanks for your patience pushing this through. I really appreciate it, and agree that these inconsistencies really need addressing
Happy to have a call if you think that'd help reach consensus sooner

@jreback
Copy link
Contributor

jreback commented Aug 27, 2023

@MarcoGorelli i am not sure why you are trying to break with the pandas api

the map_* might be ok if things were all new but they r not

stick with the axis=

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

@MarcoGorelli
Copy link
Member

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

OK yes, I would also be on board with this, +1. Then there's be no need to break API or behaviour

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 27, 2023

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

OK yes, I would also be on board with this, +1. Then there's be no need to break API or behaviour

Me too!

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 27, 2023

But if we were to remove DataFrame.apply(), I don't see that we have an alternative that does what DataFrame.apply(func, axis=1) does, so maybe the best thing is to just get rid of Series.apply().

If we keep DataFrame.apply, we should IMO have an equivalent Series.apply.

But isn't what you are proposing for Series.apply() equivalent to what can be done with Series.agg() ?

Is there a case that your proposal for Series.apply() does something different than what Series.agg() does?

If we deprecate/remove Series.apply() all together, we just point people to Series.agg() and Series.map() as replacements.

@jreback
Copy link
Contributor

jreback commented Aug 27, 2023

If we deprecate/remove Series.apply() all together, we just point people to Series.agg() and Series.map() as replacements.

yes exactly - the point is you can just use transform for a same size return shape,
agg for a reduction

and raise if that is not the case

apply was originally meant to be a one sized fits all but has a lot of warts because of this flexibility

@bashtage
Copy link
Contributor

+1 in just deprecating and removing apply in favor of transform, agg and map which can accomplish everything

+1.

I agree that deprecation/removal is the best course here. Series.apply has too many magic behaviors, and the alternatives provide a better API where the intent of the call is unambiguous.

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 29, 2023

Ok, I can see a consensus is coming along, which is great and I agree apply/agg/transform would be very similar after this PDEP.

A thought:

Are we talking about deprecating only Series.apply i.e. not apply on DataFrame and groupby objects? How would we do the below without Series.apply (after the changes in this PDEP)?:

>>> ser = pd.Series([1, 2, 3,])
>>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
>>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])
>>> ser.apply([func1, func2])
   <lambda>  <lambda>
0         4         8
1         3         3

(i.e. the result is not a transformation nor an aggregation).

My guess is it will be be done by converting to a DataFrame and back like ser.to_frame().apply([func1, func2]).iloc[:, 0]. so the question is if we're ok with that instead of having a Series.apply method. It won't be complex to maintain Series.apply after the PDEP is implemented, because we need to have Series functionality internally anyway in order to do columns/row operations in DataFrame.apply.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 29, 2023

Are we talking about deprecating only Series.apply i.e. not apply on DataFrame and groupby objects?

I thought we were talking about both Series.apply and DataFrame.apply, and to point people towards:

  • map for elementwise
  • transform and agg for operating on the whole column

Easy, simple, predictable. My favourite things

func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])

I'd be OK with it not being a core feature. You can just call the function directly if you really need it:

In [48]: >>> ser = pd.Series([1, 2, 3,])
    ...: >>> func1 = lambda ser: pd.Series([ser[0] * ser[1]**2, ser[2]])
    ...: >>> func2 = lambda ser: pd.Series([ser[0] * ser[1]**3, ser[2]])

In [49]: pd.concat([func1(ser), func2(ser)], axis=1, keys=['func1', 'func2'])
Out[49]:
   func1  func2
0      4      8
1      3      3

@topper-123
Copy link
Contributor Author

I'd be fine with deprecating apply in both Series and DataFrame. My issue above was with having an DataFrame.apply method that clearly works on Series, but not having the corresponding series functionality. I agree that using pd.concat is fine in the example I showed.

BTW: We should probably deprecate groupby.apply also, if we deprecate Series|DataFrame.apply?

@MarcoGorelli
Copy link
Member

groupby.apply is a bit different, I think it operates on the entire subdataframe corresponding to the group? Not sure there's a replacement, so I think it'd be fine to keep it (or in any case, to keep it to a separate discussion - sorting out Series.apply and DataFrame.apply would already be a win 🏆 )

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 29, 2023

I'd be fine with deprecating apply in both Series and DataFrame. My issue above was with having an DataFrame.apply method that clearly works on Series, but not having the corresponding series functionality. I agree that using pd.concat is fine in the example I showed.

I'm not sure that we can deprecate DataFrame.apply(), because of this kind of use case:

In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})

In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]

In [55]: df.apply(func, axis=1)
Out[55]:
0    10
1     2
2    30
dtype: int64

I made up an odd function, but the ability to apply a function to each row of a DataFrame is a common usage in code from my team.

@MarcoGorelli
Copy link
Member

can't you do that with DataFrame.agg:

In [8]: In [53]: df = pd.DataFrame({"x":[1,2,3], "y":[10,20,30]})
   ...:
   ...: In [54]: func = lambda s: s["y"] if s["x"]%2 == 1 else s["x"]

In [9]: df.agg(func, axis=1)
Out[9]:
0    10
1     2
2    30
dtype: int64

?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 29, 2023

can't you do that with DataFrame.agg:

Wasn't aware of that! Sounds like DataFrame.apply() can disappear as well!

@topper-123
Copy link
Contributor Author

groupby.apply is a bit different, I think it operates on the entire subdataframe corresponding to the group? Not sure there's a replacement, so I think it'd be fine to keep it (or in any case, to keep it to a separate discussion - sorting out Series.apply and DataFrame.apply would already be a win 🏆 )

@rhshadrach , do you have an opinion, I know you've looked at the groupby apply/agg/transform methods? I've looked into it a bit and to me it does look quite complex, so I'm not sure if groupby.apply does things differently than groupby.agg and groupby.transform.

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 30, 2023

Just some info: In order to deprecate apply, we will have to add a keyword parameter to Series.transform, because currently we fall back to use Series.apply there (see Apply.transform_str_or_callablein pandas/core/apply.py).

Example:

>>> df = pd.DataFrame({"x":range(100_000)})
>>> %timeit df.transform(lambda x: x + 1). # operates on the series, fast
142 µs ± 589 ns per loop
>>> %timeit df["x"].transform(lambda x: x + 1). # operates on elements, slow
15.5 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the behavioral differences are also present here, like for apply.


1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
Copy link
Member

@mroeschke mroeschke Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a 4th When users want to neither aggregate or transform a ..., they should ...

@mroeschke
Copy link
Member

Some thoughts

  1. Could you include the behavior of rolling.apply/rolling.agg for completeness
  2. I would opt for a slower deprecation process of apply given how commonly used it is in the wild
  • Remove all non-necessary apply usages from the documenation
  • DeprecationWarning in 2.x
  • FutureWarning in 3.0
  • Removal 4.0

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my comments, should replace all uses of "&" with "and". I don't think I caught them all


1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to have the case when users want to have an operation on a row of a DataFrame, they should use DataFrame.agg(func, axis=1) . Today a common use case is using DataFrame.apply(func, axis=1). Since func can do anything, it isn't entirely clear that agg with axis=1 will supply each row of the series to the function, meaning that the corresponding function doesn't need to "aggregate".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.apply(func, axis=1) is just DataFrame.T.apply(func).T, so any issue with axis=1 will also be present with axis=0. So I don't see a special issue with axis=1, except maybe some idioms that are more common with axis=1 (which may be a valid concern, though).

Can you give some example code snippets showing some possible problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642

Inclined to agree that .agg feels really unnatural here

web/pandas/pdeps/0013-standardize-apply.md Show resolved Hide resolved
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.

The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to have some examples of those "very few cases"


Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.

(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you add the string representations of functions to the tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even the behavior with strings is confusing. Here's some examples to illustrate:

>>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0    1
1    2
2    3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s.apply(abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply(np.abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply("abs")
0    1
1    2
2    3
dtype: int64


In addition to the above effects of the current implementation of `agg`/`transform` & `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented.

It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` or `map`, if `agg` and `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:

| lambda x: x + 1 | .transform | .transform |
| [lambda x: x.sum()] | .agg | .agg |

So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be really worthwhile to include some axis=1 examples for DataFrame.apply()


With the above in mind, it is proposed that in the future:

It is proposed that `apply`, `transform` and `agg` in the future will work as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is proposed that `apply`, `transform` and `agg` in the future will work as follows:
It is proposed that `apply`, `transform` and `agg` will work as follows:

(You have "in the future" in 2 consecutive sentences)

2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.

The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this text seems to be repeated from the abstract. So shorten the abstract - leave the details here


The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases.

To change the current behavior, it will have to be deprecated. This will be done by in v2.2:

1. Deprecate `Series.apply` & `DataFrame.apply`.
2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future.
2. Add a `series_ops_only` argument with type `bool | lib.NoDefault` parameter to `agg` and `transform` methods of `Series` and `DataFrame` with a default value of `lib.NoDefault`. When `series_ops_only` is set to `False`, `agg` and `transform` will behave as they do currently. When set to `True`, `agg` and `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` and `transform` will behave as `series_ops_only=False`, but will emit a `FutureWarning` the current behavior will be removed in the future.

@topper-123
Copy link
Contributor Author

I've looked into this again and have an issue: We don't have these keyword params from DataFrame.apply in agg or transform:

  • raw
  • result_type
  • engine (new in v2.2)
  • engine_kwargs (new in v2.2)

So deprecation of DataFrame.apply with implementing these in at least DataFrame.agg & DataFrame.transform will lose functionality. So I think these should be added to DataFrame.agg & DataFrame.transform?

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 20, 2023

* Remove all non-necessary apply usages from the documenation

* DeprecationWarning in 2.x

* FutureWarning in 3.0

* Removal 4.0

I've changed the deprecaton process according to this.

@topper-123 topper-123 changed the title PDEP-13: Deprecate the apply method on Series & DataFrame and make the agg and transform methods operate on series data PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data Sep 20, 2023
@topper-123
Copy link
Contributor Author

I've changed according to comments. Comments on future use of raw, result_type, engine and engine_kwargs parameters?


The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:

1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
Copy link
Member

@MarcoGorelli MarcoGorelli Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is accurate? For the use cases I've seen of DataFrame.apply, I think DataFrame.agg would probably be more likely as a replacement

Currently, say I do

In [43]: df = pd.DataFrame({'a': ['quetzal', 'quetzal', 'baboon'], 'b': ['panda', 'chinchilla', 'elk']})

In [44]: df
Out[44]:
         a           b
0  quetzal       panda
1  quetzal  chinchilla
2   baboon         elk

In [45]: def func(row):
    ...:     return f"{row['a']} or {row['b']}?"
    ...:

In [46]: df.apply(func, axis=1)
Out[46]:
0         quetzal or panda?
1    quetzal or chinchilla?
2            baboon or elk?
dtype: object

Then if I use DataFrame.map, it'll operate over all elements of the dataframe 1-by-1. Whereas if I use .agg, then I'll preserve what I'm currently doing:

In [47]: df.agg(func, axis=1)
Out[47]:
0         quetzal or panda?
1    quetzal or chinchilla?
2            baboon or elk?
dtype: object

This feels very unnatural though - func here doesn't look like an aggregation at all, so I wouldn't have reached for .agg

Could DataFrame.map get an axis keyword as well, so that DataFrame.map(func, axis=1) preserves the current behaviour of DataFrame.apply(func, axis=1)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should.

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform would raise here:

In [53]: df.transform(func, axis=1)
---------------------------------------------------------------------------

ValueError: Function did not transform

And this underscores just how confusing this is

is it columm by column, row by row, or the entire frame. Perhaps they should.

yes, exactly, it could be:

  • axis=None: element by element (current behaviour)
  • axis=1: row by row
  • axis=0: element by element

Then, df.apply(lambda row: ..., axis=1) can just become df.map(lambda row: ..., axis=1)

This is a very common way of using apply, e.g.

titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)

from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic

Copy link
Contributor Author

@topper-123 topper-123 Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli , your example is an aggregation though and func does operate on rows, not elements, so everything works as intended in your agg example.

However, I guess agg linguistically(?) is best understood as a numerical reduction, not a text combination operation, so using the name agg may be a bit unintuitive. My original proposal would have kept apply for situations like this, because the problem I wanted to solve was apply giving subtly different results for Series.apply and DataFrame.apply (including for axis=1).

Your example could be an argument for keeping apply, but drop the element-by-element behavior of Series.apply, i.e. revert to the original proposal of this PDEP.

I'm -1 on making map operate on rows/columns. It's a strength of the method that it always gives same-shaped results and it would be confusing for the result changing shape based on if an axis parameter is given or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it's a horizontal aggregation: it combines multiple columns into 1

I'm just thinking about the upgrade process

I've only ever seen people use DataFrame.apply to create a single column, and if the upgrade process is going to be:

  • Series.apply(func) => Series.map(func)
  • DataFrame.apply(func, axis=1) => DataFrame.agg(func, axis=1)

then that could be confusing, and certainly harder than just grepping for .apply

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1) to DataFrame.agg(func, axis=1) to be non-intuitive.

But maybe we should consider introducing a method DataFrame.project(func) that is the replacement for DataFrame.apply(func, axis=1) . Idea being that you "project" a function onto each row of the DF.

Or should we make DataFrame.transform(func, axis=1) work by transforming each row?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait let's not add yet another method to the API 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just another voice here to support the notion that requiring agg for this use case feels very unnatural)

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 29, 2023

Yeah I see where you are coming from. My idea is to alleviate this by doing good doc updates. Could e.g. Have some common examples showing how to do the change. Do you have some ideas?

Btw, I’m not opposed to keeping or deprecating apply (it’s very similar to agg/transform after all), but must admit I’m being a bit worn down by the long process and would really like a decision soonish…

@MarcoGorelli
Copy link
Member

Sure, good docs + an update guide would be good, but it does become more complicated than just "use map instead"

Would adding axis to map be a blocker for you?

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the DataFrame.apply(func, axis=1) case has to be resolved. I think that @MarcoGorelli and I are concerned about that particular case.

Also, I think the behavior with functions specified as strings should be addressed, given the inconsistent behavior that I have noted in one of my comments.


1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise
2. `Series.apply` and `DataFrame.apply` will be deprecated.
3. The current behavior when supplying string to the methods will not be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. The current behavior when supplying string to the methods will not be changed.
3. The current behaviors when supplying strings for the function arguments to the methods will not be changed.


The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:

1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1) to DataFrame.agg(func, axis=1) to be non-intuitive.

But maybe we should consider introducing a method DataFrame.project(func) that is the replacement for DataFrame.apply(func, axis=1) . Idea being that you "project" a function onto each row of the DF.

Or should we make DataFrame.transform(func, axis=1) work by transforming each row?


Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.

(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even the behavior with strings is confusing. Here's some examples to illustrate:

>>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0    1
1    2
2    3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s.apply(abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply(np.abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply("abs")
0    1
1    2
2    3
dtype: int64


The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.

@topper-123
Copy link
Contributor Author

Would adding axis to map be a blocker for you?

My issue is that axis=1will be a different operation type than axis=0, i.e. one would work element by element and the other row by row and yield subtly different results, and we'd get similar problems as described in #52140.

I would be more positive to keep apply, so it can be used for row/column operations that don't fit into agg/transform, but then we'd then have to fix the issue with Series.apply somehow. For example in your example I'd expect df.iloc[0].apply(func) to give "quetzal or panda?", but currently it fails.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 1, 2023

hmm OK, point taken. there'd be a long deprecation process, and plenty of time to teach/warn whether df.agg(..., axis=1) or df.transform(..., axis=1) is the right replacement. We could write an update guide, and link to that in the deprecation warnings, and give plenty of "how to update your code" examples

You're right that this has been going for some time, so maybe we could call a vote?

EDIT

as Patrick correctly pointed out, this requires a ping of pandas-core and some time for people to catch up - we've done that now, so let's start the process by which we can call a vote in the future

@phofl
Copy link
Member

phofl commented Oct 1, 2023

This needs a ping of @pandas-dev/pandas-core before we can even consider calling a vote

@jorisvandenbossche
Copy link
Member

Thanks for the ping, Patrick. On a procedural note, in the future it might be best to ping the core team earlier (and notify the pandas-dev mailing list), or at least if you think the PDEP is ready for general discussion.

(looking at the discussion from just the past few days, it seems there are still significant open questions even among the proponents of the PDEP, so as someone who still needs to read up on the whole discussion this also seems a bit strange that it would be ready for a vote? But I first now do that, read the actual PDEP and discussion ;))

@jorisvandenbossche
Copy link
Member

Would adding axis to map be a blocker for you?

My issue is that axis=1will be a different operation type than axis=0, i.e. one would work element by element and the other row by row and yield subtly different results, and we'd get similar problems as described in #52140.

I personally agree with @topper-123 here. One of the nice things about map currently is that its scope is very clear: it simply maps your function or mapping to each individual element of the Series (or DataFrame), and that's it.
Expanding that to allow row-wise operations (and then also column-wise operations?) would be a step back for map IMO.

I understand the rationale of wanting to deprecate Series.apply (I haven't yet made up my mind whether I find those reasons worth the trouble actually doing that, but at least I see the reasons), but I don't really understand the decision to also deprecate DataFrame.apply. This method already works on a series, and so doesn't have the problem of Series.apply of requiring a different kind of UDF (eg DataFrame.apply and groupby.apply are already quite consistent with each other?).


## Motivation

The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations.
Copy link
Member

@WillAyd WillAyd Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the motivation to be pretty strong from a developer perspective but I think it is lacking from an end user perspective. To play devil's advocate...why should a non-expert user care or try to understand this distinction?

While there are definitely warts here part of me thinks we've had them for X number of years and gotten by without too much end user complaint about it. I think this would be the largest deprecation we've done on the project as long as I've been involved, so I'm a little wary of causing churn instead of just trying to "softly" guide users towards the more explicit map / agg / transform options

| | Series | DataFrame | groupby |
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
| other callables (non ufunc) | Try elements, fallback to series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "try elements" is already deprecated, right?

In [19]: df = pd.DataFrame({"A": range(3)})

In [21]: df["A"].agg(lambda x: np.sum(x))
<ipython-input-21-2e8408eae29f>:1: FutureWarning: using <function <lambda> at 0x7f26752e5ee0> in Series.agg cannot aggregate and has been deprecated. Use Series.transform to keep behavior unchanged.
  df["A"].agg(lambda x: np.sum(x))
Out[21]: 
0    0
1    1
2    2
Name: A, dtype: int64


| | Series | DataFrame | groupby |
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, it is mentioned that agg (and others) will always operate on a Series, which might indirectly apply that this line in the table is correct.
But a ufunc never aggregates, so I assume we don't want this, and the end goal should be that agg always reduces, and so never accepts a ufunc ?

| | Series | DataFrame | groupby |
|:-----------------------------------|:---------|:------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
| other callables (non ufunc) | elements | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| other callables (non ufunc) | elements | series | series |
| other callables (non ufunc) | elements | series | dataframe |

?

(and this is actually one of the other inconsistencies in the whole groupby side of the story, that we don't have a clear way to choose whether you want to apply your function to each column in the group or on the group as a whole, i.e. as a dataframe)

|:-----------------------------------|:---------|:------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
| other callables (non ufunc) | elements | series | series |
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series |
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | - |

I am not sure groupby.apply accepts a list or dict?

For example

In [42]: df = pd.DataFrame({"A": range(3)})

In [43]: df.groupby([0, 1, 0]).apply([lambda x: x])
...
TypeError: unhashable type: 'list'


1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data
2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data
3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of Series.apply, and in this case (being passed a single function), it always operates element by element, consistently and unambiguously.

@jbrockmendel
Copy link
Member

Some half-baked thoughts:

  1. Several comments in this thread about Series.apply being used to operate element-wise being the most common usage (and hence a painful deprecation). I find this plausible but have no strong prior here. If it is the case, what if we did a 180 and deprecated all not-that usages of Series.apply? That would not resolve the problem of Series.apply being inconsistent with EverythingElse.apply, but would at least make Series.apply internally consistent. It would make Series.map completely redundant though. (Not advocating, just spitballing)

  2. My kneejerk reaction to adding more methods or keywords is negative. Especially axis=1, if it turns out that doing so is basically just an alias for df.T.whatever().T.

  3. Will's comment about this being more compelling for devs than for users seems reasonable. For devs or type-conscious users we could introduce a lint check (or maybe something like the ast thing @MarcoGorelli discussed at the sprint) to discourage bad uses of apply?

  4. The big motivator that I haven't seen in this thread was @rhshadrach's discussion of untangling the way that apply/agg/transform/whatever call each other is hard-to-maintain (and therefore likely buggy) code. I don't see anything about that in this thread, so may be remembering that wrong?

@MarcoGorelli
Copy link
Member

Series.apply being used to operate element-wise being the most common usage (and hence a painful deprecation). I find this plausible but have no strong prior here

Let's try to quantify how plausible this is:

I grepped \.apply( on my scrapes of the top 100 most upvoted Kaggle notebooks, and found:

  • 134 cases (some within the same notebook) of using apply (either Series.apply or DataFrame.apply(..., axis=1)) element-wise
  • 1 case of using Series.apply on the Series (d['sell_price'].apply(np.log1p))

There's a tool which promises to parallelise apply, and their examples shows an element-wise operation: https://github.com/nalepae/pandarallel

Searching "what does pandas.Series.apply do" on Google, these are my top results (other than pandas official docs):

Asking chat gpt, I get "In Pandas, the apply() method in the context of a Pandas Series is used to apply a function along the axis of the Series. It can be particularly useful when you want to apply a custom or lambda function to each element of the Series."

So yes, I think this is overwhelmingly the most common use case

@mroeschke
Copy link
Member

I also find Will's #54747 (comment) compelling, though complexities of maintenance ultimately bleeds into user complexities. I would guess most users treat apply as "the way you pass pandas objects into your UDF and get back a pandas object".

Throwing out another idea for discussion what if we introduce a how=None|"element-wise"|"row/column-wise" keyword in apply to more explicitly dictate what is passed into the UDF(s). I think this would allow eventual straightening of the internals (element-wise just calls map), users get to keep using apply, and through keyword deprecations slowly remove the complex falling back behavior.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 13, 2023

I also find Will's #54747 (comment) compelling, though complexities of maintenance ultimately bleeds into user complexities. I would guess most users treat apply as "the way you pass pandas objects into your UDF and get back a pandas object".

I agree.

Throwing out another idea for discussion what if we introduce a how=None|"element-wise"|"row/column-wise" keyword in apply to more explicitly dictate what is passed into the UDF(s). I think this would allow eventual straightening of the internals (element-wise just calls map), users get to keep using apply, and through keyword deprecations slowly remove the complex falling back behavior.

We have to distinguish functions(UDF's) that work element-by-element versus functions that aggregate. See examples I wrote here: #54747 (comment)

How about we start with how= None | "by-element" | "by-series", with how=None being the default. Then, for the next release:

  1. Series.apply(func, how="by-element") is same as Series.map(func)
  2. Series.apply(func, how="by-series") is same as func(Series.values)
  3. DataFrame.apply(func, how="by-element") is same as DataFrame.map(func) (and axis argument is ignored)
  4. DataFrame.apply(func, how="by-series", axis=0) is same as applying func to each column of the DataFrame
  5. DataFrame.apply(func, how="by-series", axis=1) is same as applying func to each row of the DataFrame
  6. Series.apply(func, how=None) is same as Series.apply(func) today (complex behavior)
  7. DataFrame.apply(func, how=None) is same as DataFrame.apply(func) today (complex behavior)

Then for a later release, we deprecate the None argument, and change the default behavior, so that for Series the default is by-element and for DataFrame the default is by-series.

For the next release, if None creates a behavior that is different than the future defaults of by-element for Series and by-series for DataFrame, we issue a warning that the user needs to be more explicit.

I think if we did it this way, then most existing code out there would continue to work as it does today, modulo some of the odd behavior that occurs when func is a UDF, vs. an aggregation-type function, vs. a function specified as a string. But that odd behavior creates a warning.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 13, 2023
@mroeschke
Copy link
Member

Looks like discussion here has sufficiently stalled without clarity on how to move foward. Going to close for now to clear the queue but feel free to open back up if you want to reengage here

@mroeschke mroeschke closed this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.