Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Undefined behavior when rolling on datetime in a multi index of a groupby. #56530

Closed
3 tasks done
kirby88 opened this issue Dec 17, 2023 · 4 comments
Closed
3 tasks done
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Window rolling, ewma, expanding

Comments

@kirby88
Copy link

kirby88 commented Dec 17, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://www.online-python.com/Hj96OPatUx

Issue Description

I have a dataframe that consist of a 4 columns: 2 columns that are part of the multi index, one column for time and one for a value.
Now, I want to group by first 2 columns, do the average of the value per timestamp, and then do a rolling average of the last 2 seconds.

If you run the following code:

import pandas as pd

group2 = (["0"]*9)+(["1%"]*9)
ts = ((["100"]*3)+(["101"]*3)+(["102"]*3))*2
value = range(18)

df = pd.DataFrame({
    "group1":["AAA"]*18,
    "group2": group2,
    "ts":ts,
    "value": value
})
df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

groupby = df.groupby(["group1", "group2", "timestamp"])['value'].agg('mean')
result = groupby.rolling(window="2s", on=groupby.index.levels[2], min_periods=1, closed="left").mean()
# Works as expected
print(result)

It will print

group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40     NaN
                1970-01-01 00:01:41     1.0
                1970-01-01 00:01:42     2.5
        1%      1970-01-01 00:01:40     NaN
                1970-01-01 00:01:41    10.0
                1970-01-01 00:01:42    11.5
Name: value, dtype: float64

Which is expected, there are 2 groups (AAA,0) and (AAA, 1%), each of them has 3 timestamps, and we compute the average of the average over the last 2s.

But now if I add only one timestamp, and run the exact same code, the results are totally messed up

group2 = (["0"]*13)+(["1%"]*13)
ts = ((["100"]*3)+(["101"]*3)+(["102"]*3+(["103"]*4)))*2
value = range(26)
df = pd.DataFrame({
    "group1":["AAA"]*26,
    "group2": group2,
    "ts":ts,
    "value": value
})
df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

groupby = df.groupby(["group1", "group2", "timestamp"])['value'].agg('mean')
result = groupby.rolling(window="2s", on=groupby.index.levels[2], min_periods=1, closed="left").mean()

print(result)
group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40      NaN
                1970-01-01 00:01:41      NaN
                1970-01-01 00:01:42      NaN
                1970-01-01 00:01:43      NaN
        1%      1970-01-01 00:01:40    5.625
                1970-01-01 00:01:41    7.300
                1970-01-01 00:01:42      NaN
                1970-01-01 00:01:43      NaN
Name: value, dtype: float64

Expected Behavior

Instead, I would have expected the following results:

group1  group2  timestamp          
AAA     0       1970-01-01 00:01:40      NaN
                1970-01-01 00:01:41      1.0
                1970-01-01 00:01:42      2.5
                1970-01-01 00:01:43      5.5
        1%      1970-01-01 00:01:40    Nan
                1970-01-01 00:01:41    14.0
                1970-01-01 00:01:42      15.5
                1970-01-01 00:01:43      18.5
Name: value, dtype: float64

Installed Versions

commit : a671b5a
python : 3.12.0.final.0
python-bits : 64
OS : Darwin
OS-release : 23.1.0
Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@kirby88 kirby88 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2023
@adityach007
Copy link

The issue you're facing seems to be related to the behavior of rolling operations with' 2s' windows in a multi-indexed DataFrame based on timestamps. Specifically, the behavior seems to change when additional timestamps are added, leading to unexpected results.

From the your code, when more timestamps are added, the rolling average seems to be affected, resulting in unexpected NaN values instead of the expected rolling average over the last 2 seconds for each group.

import pandas as pd

Your DataFrame creation code here...

df['timestamp'] = pd.to_datetime(df['ts'], unit='s')

Ensure the DataFrame is sorted by the timestamp

df = df.sort_values(['group1', 'group2', 'timestamp'])

Perform the groupby operation and compute rolling mean

groupby = df.groupby(["group1", "group2"])['value']
result = groupby.rolling(window="2s", on='timestamp').mean().reset_index(level=[0, 1])

print(result)

This adjustment might help in achieving the expected behavior of computing the rolling average over the last 2 seconds for each group, considering the timestamps within the DataFrame.

@kirby88
Copy link
Author

kirby88 commented Dec 17, 2023

@adityach007 Not only your answer doesn't compile (timestamp is not part of the group by columns), but it doesn't do what I asked (I want to do an rolling aggregation of an aggregation by timestamp).

@rhshadrach
Copy link
Member

@kirby88 - a small modification to @adityach007's suggestion makes it work.

groupby = df.groupby(["group1", "group2", "timestamp"], as_index=False)['value'].agg('mean')
print(
    groupby.groupby(["group1", "group2"])
    .rolling(window="2s", on="timestamp", min_periods=1, closed="left")
    .mean()
)

Thanks @adityach007!

Regarding your original code, you are rolling on groupby.index.levels[2]. That is not the same as groupby.index.get_level_values(2). But then trying to use this, you have issues with sorting. That is why you should use .groupby(...).rolling(...) as @adityach007 has suggsted.

@rhshadrach rhshadrach added Window rolling, ewma, expanding Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2023
@kirby88
Copy link
Author

kirby88 commented Dec 19, 2023

Understood. Thank you for having a look and providing a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants