BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

kirby88 · 2023-12-17T11:35:34Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://www.online-python.com/Hj96OPatUx

Issue Description

I have a dataframe that consist of a 4 columns: 2 columns that are part of the multi index, one column for time and one for a value.
Now, I want to group by first 2 columns, do the average of the value per timestamp, and then do a rolling average of the last 2 seconds.

If you run the following code:

import pandas as pd

group2 = (["0"]*9)+(["1%"]*9)
ts = ((["100"]*3)+(["101"]*3)+(["102"]*3))*2
value = range(18)

df = pd.DataFrame({
    "group1":["AAA"]*18,
    "group2": group2,
    "ts":ts,
    "value": value
})
df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

groupby = df.groupby(["group1", "group2", "timestamp"])['value'].agg('mean')
result = groupby.rolling(window="2s", on=groupby.index.levels[2], min_periods=1, closed="left").mean()
# Works as expected
print(result)

It will print

group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40     NaN
                1970-01-01 00:01:41     1.0
                1970-01-01 00:01:42     2.5
        1%      1970-01-01 00:01:40     NaN
                1970-01-01 00:01:41    10.0
                1970-01-01 00:01:42    11.5
Name: value, dtype: float64

Which is expected, there are 2 groups (AAA,0) and (AAA, 1%), each of them has 3 timestamps, and we compute the average of the average over the last 2s.

But now if I add only one timestamp, and run the exact same code, the results are totally messed up

group2 = (["0"]*13)+(["1%"]*13)
ts = ((["100"]*3)+(["101"]*3)+(["102"]*3+(["103"]*4)))*2
value = range(26)
df = pd.DataFrame({
    "group1":["AAA"]*26,
    "group2": group2,
    "ts":ts,
    "value": value
})
df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

groupby = df.groupby(["group1", "group2", "timestamp"])['value'].agg('mean')
result = groupby.rolling(window="2s", on=groupby.index.levels[2], min_periods=1, closed="left").mean()

print(result)

group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40      NaN
                1970-01-01 00:01:41      NaN
                1970-01-01 00:01:42      NaN
                1970-01-01 00:01:43      NaN
        1%      1970-01-01 00:01:40    5.625
                1970-01-01 00:01:41    7.300
                1970-01-01 00:01:42      NaN
                1970-01-01 00:01:43      NaN
Name: value, dtype: float64

Expected Behavior

Instead, I would have expected the following results:

group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40      NaN
                1970-01-01 00:01:41      1.0
                1970-01-01 00:01:42      2.5
                1970-01-01 00:01:43      5.5
        1%      1970-01-01 00:01:40    Nan
                1970-01-01 00:01:41    14.0
                1970-01-01 00:01:42      15.5
                1970-01-01 00:01:43      18.5
Name: value, dtype: float64

Installed Versions

commit : a671b5a
python : 3.12.0.final.0
python-bits : 64
OS : Darwin
OS-release : 23.1.0
Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

adityach007 · 2023-12-17T16:23:54Z

The issue you're facing seems to be related to the behavior of rolling operations with' 2s' windows in a multi-indexed DataFrame based on timestamps. Specifically, the behavior seems to change when additional timestamps are added, leading to unexpected results.

From the your code, when more timestamps are added, the rolling average seems to be affected, resulting in unexpected NaN values instead of the expected rolling average over the last 2 seconds for each group.

import pandas as pd

Your DataFrame creation code here...

df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

Ensure the DataFrame is sorted by the timestamp

df = df.sort_values(['group1', 'group2', 'timestamp'])

Perform the groupby operation and compute rolling mean

groupby = df.groupby(["group1", "group2"])['value']
result = groupby.rolling(window="2s", on='timestamp').mean().reset_index(level=[0, 1])

print(result)

This adjustment might help in achieving the expected behavior of computing the rolling average over the last 2 seconds for each group, considering the timestamps within the DataFrame.

kirby88 · 2023-12-17T19:20:34Z

@adityach007 Not only your answer doesn't compile (timestamp is not part of the group by columns), but it doesn't do what I asked (I want to do an rolling aggregation of an aggregation by timestamp).

rhshadrach · 2023-12-18T16:27:11Z

@kirby88 - a small modification to @adityach007's suggestion makes it work.

groupby = df.groupby(["group1", "group2", "timestamp"], as_index=False)['value'].agg('mean')
print(
    groupby.groupby(["group1", "group2"])
    .rolling(window="2s", on="timestamp", min_periods=1, closed="left")
    .mean()
)

Thanks @adityach007!

Regarding your original code, you are rolling on groupby.index.levels[2]. That is not the same as groupby.index.get_level_values(2). But then trying to use this, you have issues with sorting. That is why you should use .groupby(...).rolling(...) as @adityach007 has suggsted.

kirby88 · 2023-12-19T17:15:59Z

Understood. Thank you for having a look and providing a solution.

kirby88 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2023

rhshadrach added Window rolling, ewma, expanding Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2023

rhshadrach closed this as completed Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

kirby88 commented Dec 17, 2023 •

edited

Loading

adityach007 commented Dec 17, 2023

kirby88 commented Dec 17, 2023

rhshadrach commented Dec 18, 2023

kirby88 commented Dec 19, 2023

BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

Comments

kirby88 commented Dec 17, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

adityach007 commented Dec 17, 2023

Your DataFrame creation code here...

Ensure the DataFrame is sorted by the timestamp

Perform the groupby operation and compute rolling mean

kirby88 commented Dec 17, 2023

rhshadrach commented Dec 18, 2023

kirby88 commented Dec 19, 2023

kirby88 commented Dec 17, 2023 •

edited

Loading