Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Series|Expr.rolling_var and Series|Expr.rolling_std #1451

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Nov 27, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Opening as draft, I still need to figure out how to deal with null/nan's to match polars behavior.

Edit: Scary size diff, yet 25%+ is just the copying of methods to the stable api, and another good 10% are docstrings examples

@github-actions github-actions bot added the enhancement New feature or request label Nov 27, 2024
@@ -446,3 +446,39 @@ def _parse_time_format(arr: pa.Array) -> str:
if pc.all(matches.is_valid()).as_py():
return time_fmt
return ""


def pad_series(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a better name to be found

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Nov 28, 2024

Two main comments:

  • Checking performances: on 1M floats, pyarrow is ~2x slower than pandas πŸ€”
  • Polars changed behavior in v1 by returning nan (rightfully so) instead of 0.0 for windows with not enough finite elements - for now I am raising a NotImplementedError, similarly to ewm_mean

s = pd.Series(values)
window_size = random.randint(2, len(s)) # noqa: S311
min_periods = random.randint(2, window_size) # noqa: S311
ddof = random.randint(0, min_periods - 1) # noqa: S311
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it took me a bit, but I figured out that the different behavior appear when ddof > min_periods because the denominator can be negative or zero.

Pandas seems to return NaN indistinctly for windows that do not have enough non null elements and these cases, while polars returns inf (and same in pyarrow since we can control that)

@FBruzzesi FBruzzesi marked this pull request as ready for review November 30, 2024 09:44
Comment on lines +3533 to +3544
>>> agnostic_rolling_std(df_pl)
shape: (4, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a ┆ b β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ f64 β”‚
β•žβ•β•β•β•β•β•β•ͺ══════════║
β”‚ 1.0 ┆ 0.0 β”‚
β”‚ 2.0 ┆ 0.707107 β”‚
β”‚ null ┆ 0.707107 β”‚
β”‚ 4.0 ┆ 1.414214 β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ€” is this a bug in Polars?

In [37]: pl.Series([1,2,None,4]).rolling_std(3, min_periods=1)
Out[37]:
shape: (4,)
Series: '' [f64]
[
        0.0
        0.707107
        0.707107
        1.414214
]

In [38]: pl.Series([1,2,1,4]).rolling_std(3, min_periods=1)
Out[38]:
shape: (4,)
Series: '' [f64]
[
        null
        0.707107
        0.57735
        1.527525
]

I think the first element of the result should be null, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true! It definitly doesn't look right! Worth reporting upstream I imagine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants