Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Support win_type parameter with groupby.rolling #30559

Open
ghost opened this issue Dec 30, 2019 · 9 comments
Open

BUG: Support win_type parameter with groupby.rolling #30559

ghost opened this issue Dec 30, 2019 · 9 comments
Labels
Enhancement Window rolling, ewma, expanding

Comments

@ghost
Copy link

ghost commented Dec 30, 2019

It appears that rolling aggregations on groupby objects do not behave as expected. It looks like the win_type parameter is ignored. Here's a minimal example:

Code Sample

In this example, a rolling mean is calculated with "uniform" weights and also with "blackman" weights. In each case, the mean is calculated on the ungrouped data and the grouped data, which should give the same result as there is only one group.

import numpy as np
import pandas as pd

df = pd.DataFrame({'groups': ['g']*10,
                   'data': np.sin(np.arange(10))})

groups = df[['data', 'groups']].groupby('groups')

# Rolling mean with uniform weights
df['uniform'] = df['data'].rolling(4, win_type=None).mean()
df['grp_uniform'] = groups.rolling(4, win_type=None).mean().values

# Rolling mean with a e.g. blackman weights
df['blackman'] = df['data'].rolling(4, win_type='blackman').mean()
df['grp_blackman'] = groups.rolling(4, win_type='blackman').mean().values

# The results should be equal, whether on grouped or ungrouped data
assert df['uniform'].equals(df['grp_uniform'])     # This passes
assert df['blackman'].equals(df['grp_blackman'])   # This fails
print(df)

Problem description

The last assert statement fails: the rolling aggregation on the groupby object does not work as expected.

In that last calculation, it looks like the win_type='blackman' parameter was simply ignored, and the calculation was done with uniform weights.

Expected Output

Here is the resulting dataframe from the calculations above. The last two columns ought to be identical (but are not):

  groups      data   uniform  grp_uniform  blackman  grp_blackman
0      g  0.000000       NaN          NaN       NaN           NaN
1      g  0.841471       NaN          NaN       NaN           NaN
2      g  0.909297       NaN          NaN       NaN           NaN
3      g  0.141120  0.472972     0.472972  0.875384      0.472972
4      g -0.756802  0.283771     0.283771  0.525209      0.283771
5      g -0.958924 -0.166327    -0.166327 -0.307841     -0.166327
6      g -0.279415 -0.463506    -0.463506 -0.857863     -0.463506
7      g  0.656987 -0.334539    -0.334539 -0.619170     -0.334539
8      g  0.989358  0.102001     0.102001  0.188786      0.102001
9      g  0.412118  0.444762     0.444762  0.823172      0.444762

This is possibly related to similar issues #26597 and #26462.

Here's my best effort at diagnosing what is going on under the hood... it looks like the pandas.core.window.RollingGroupby class inherits the mean() method from the Rolling class, and hence completely ignores the win_type parameter.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2018.9
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : 0.29.6
pytest : 4.3.1
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.1.5
lxml.etree : 4.3.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 7.4.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.2
matplotlib : 3.0.3
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.1
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.1
tables : 3.5.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.5

@TomAugspurger
Copy link
Contributor

Thanks for the report. This does look buggy. Further investigation into why this is happening / a PR to fix it would be most welcome!

@TomAugspurger TomAugspurger added the Window rolling, ewma, expanding label Jan 3, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 3, 2020
@ghost
Copy link
Author

ghost commented Jan 4, 2020

Hi @TomAugspurger ,

Thanks for the response! I've tried to find out what's happening with some debugging, and I think I've found the cause. I haven't been able to make a fix myself yet as it requires some refactoring of classes that requires some discussion, but perhaps I'll be able to do it with some pointers from yourself or any others more experienced with the code base than myself.

It looks like there are three classes of interest, all in core/window/rolling.py:

  • _Window has most window-related functionality, but not including weights
  • Window inherits from _Window and adds in functionality for weighted rolling means
  • RollingGroupby inherits from _Window and WindowGroupbyMixin but does not inherit from Window. So, it misses out the required weighted rolling mean functionality.

Here is the weighted rolling mean method which is implemented in the Window class:

def mean(self, *args, **kwargs):
nv.validate_window_func("mean", args, kwargs)
window_func = self._get_roll_func("roll_weighted_mean")
window_func = self._get_weighted_roll_func(window_func, _use_window)
return self._apply(
window_func, center=self.center, is_weighted=True, name="mean", **kwargs
)

I tried copying this method to the RollingGroupby class, but encountered an issue: it looks like the _apply method is overridden by WindowGroupbyMixin, and so the snippet above does not work. Here is WindowGroupbyMixing._apply():

def _apply(
self,
func: Callable,
center: bool,
require_min_periods: int = 0,
floor: int = 1,
is_weighted: bool = False,
name: Optional[str] = None,
use_numba_cache: bool = False,
**kwargs,
):
"""
Dispatch to apply; we are stripping all of the _apply kwargs and
performing the original function call on the grouped object.
"""
kwargs.pop("floor", None)
# TODO: can we de-duplicate with _dispatch?
def f(x, name=name, *args):
x = self._shallow_copy(x)
if isinstance(name, str):
return getattr(x, name)(*args, **kwargs)
return x.apply(name, *args, **kwargs)
return self._groupby.apply(f)

Any advice or ideas on the philosophy behind these classes and where everything should go would be greatly appreciated!

@ghost
Copy link
Author

ghost commented Jan 4, 2020

It looks like @mroeschke and @jbrockmendel are the authors of the WindowGroupbyMixin class, perhaps they could advise on how we can fix this issue? Any tips greatly appreciated, eager to help if I can.

@mroeschke
Copy link
Member

Thanks for investigating @Connossor. win_type is essentially not supported with groupby rolling right now.

As you noticed the Window class essentially exclusively contains the weighted aggregation code while the Rolling class contains the non weighted aggregation code (and the mixin only using the Rolling class).

Soon I'm going to be experimenting on unifying these two (which would fix this issue), but it will unlikely be fixed in the next release unfortunately.

@mroeschke mroeschke changed the title BUG: Rolling aggregations on grouped data ignore win_type parameter BUG: Support win_type parameter with groupby.rolling Oct 5, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@johndmccann93
Copy link

Any ETA on a solution for this or, in the meantime, a workaround?

@lisrael1
Copy link

issue still persists at version 2.0.3.
the rectangle window is very noisy, the output of the window is not smooth.
the win_type support is needed...
thanks

as a workaround, gpt 3.5 wrote me this code:

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'cost': [1, 2, 3, 4, 5, 6]
})

# Define the rolling window size and win_type
window_size = 10
win_type = 'hamming'

# Iterate over each unique category
for category in df['category'].unique():
    # Filter the DataFrame for the current category
    category_df = df[df['category'] == category]
    
    # Calculate the rolling mean for the current category
    rolling_mean = category_df['cost'].rolling(window_size, win_type=win_type).mean()
    
    # Update the DataFrame with the rolling mean values
    df.loc[df['category'] == category, 'cost_roll'] = rolling_mean

print(df)

instead of this:

df['cost_roll'] = df.groupby('category', as_index=False).cost.rolling(10, win_type='hamming').mean().cost

@SaundersR1999
Copy link

For loops are usable for smaller/less complex use cases but performance could really suffer if win_type is needed in cases where groupby would be used to group by two or more columns. Shame this has not been fixed yet.

@jbrockmendel
Copy link
Member

Shame this has not been fixed yet.

You're welcome to submit a PR. Comments like this are counter-productive.

@francoispierrepaty
Copy link

A workaround is:

df.groupby("GROUP").apply(
        lambda group: group["VALUE"]
        .rolling(window=window, min_periods=min_periods, center=center, win_type='hamming')
        .mean()
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

7 participants