You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
https://www.online-python.com/Hj96OPatUx
Issue Description
I have a dataframe that consist of a 4 columns: 2 columns that are part of the multi index, one column for time and one for a value.
Now, I want to group by first 2 columns, do the average of the value per timestamp, and then do a rolling average of the last 2 seconds.
If you run the following code:
importpandasaspdgroup2= (["0"]*9)+(["1%"]*9)
ts= ((["100"]*3)+(["101"]*3)+(["102"]*3))*2value=range(18)
df=pd.DataFrame({
"group1":["AAA"]*18,
"group2": group2,
"ts":ts,
"value": value
})
df['timestamp'] =pd.to_datetime(df['ts'], unit='s')
groupby=df.groupby(["group1", "group2", "timestamp"])['value'].agg('mean')
result=groupby.rolling(window="2s", on=groupby.index.levels[2], min_periods=1, closed="left").mean()
# Works as expectedprint(result)
Which is expected, there are 2 groups (AAA,0) and (AAA, 1%), each of them has 3 timestamps, and we compute the average of the average over the last 2s.
But now if I add only one timestamp, and run the exact same code, the results are totally messed up
The issue you're facing seems to be related to the behavior of rolling operations with' 2s' windows in a multi-indexed DataFrame based on timestamps. Specifically, the behavior seems to change when additional timestamps are added, leading to unexpected results.
From the your code, when more timestamps are added, the rolling average seems to be affected, resulting in unexpected NaN values instead of the expected rolling average over the last 2 seconds for each group.
Perform the groupby operation and compute rolling mean
groupby = df.groupby(["group1", "group2"])['value']
result = groupby.rolling(window="2s", on='timestamp').mean().reset_index(level=[0, 1])
print(result)
This adjustment might help in achieving the expected behavior of computing the rolling average over the last 2 seconds for each group, considering the timestamps within the DataFrame.
@adityach007 Not only your answer doesn't compile (timestamp is not part of the group by columns), but it doesn't do what I asked (I want to do an rolling aggregation of an aggregation by timestamp).
Regarding your original code, you are rolling on groupby.index.levels[2]. That is not the same as groupby.index.get_level_values(2). But then trying to use this, you have issues with sorting. That is why you should use .groupby(...).rolling(...) as @adityach007 has suggsted.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I have a dataframe that consist of a 4 columns: 2 columns that are part of the multi index, one column for time and one for a value.
Now, I want to group by first 2 columns, do the average of the value per timestamp, and then do a rolling average of the last 2 seconds.
If you run the following code:
It will print
Which is expected, there are 2 groups (AAA,0) and (AAA, 1%), each of them has 3 timestamps, and we compute the average of the average over the last 2s.
But now if I add only one timestamp, and run the exact same code, the results are totally messed up
Expected Behavior
Instead, I would have expected the following results:
Installed Versions
commit : a671b5a
python : 3.12.0.final.0
python-bits : 64
OS : Darwin
OS-release : 23.1.0
Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.1.4
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: