-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881
Conversation
python/pyspark/pandas/frame.py
Outdated
elif isinstance(var_name, str): | ||
elif is_list_like(var_name): | ||
raise ValueError(f"{var_name=} must be a scalar.") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed from: pandas-dev/pandas#55948
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, Pandas
seems to change again. 😞
AssertionError: Series are different
Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]: [0, -1, NaN]
[right]: [0, -1, None]
During handling of the above exception, another exception occurred:
Could you check the failures?
Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes 😢 Let me fix them. Thanks for the confirm! |
def _calculate_bins(self, data, bins): | ||
return bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas recently pushed couple of commits for refactoring the internal plotting structure such as pandas-dev/pandas#55850 or pandas-dev/pandas#55872, so we also should inherits couple of internal methods to follow the latest Pandas behavior.
new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME)) | ||
if not ignore_index and not should_return_series: | ||
new_objs.append(obj.to_frame()) | ||
else: | ||
new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to pandas-dev/pandas#15047
I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc @HyukjinKwon @dongjoon-hyun FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two questions.
- Is the change of
python/pyspark/pandas/resample.py
safe? - What happens when the users decide to use old Pandas (<= 2.2.0)?
It breaks the previous behavior, so if we plan to release other minor release (Spark 3.6.0) this should not be included.
Using deprecated aliases ( |
We should not bring any breaking change. Let me address them. Thanks, @dongjoon-hyun for double checking. |
Oh, wait. I just remembered that we just follow the Pandas behavior and separately mention the breaking changes into release note.
So maybe we should add a release note instead of reverting the breaking changes here? @dongjoon-hyun @HyukjinKwon |
Just updated to resample work in old Pandas as well. I think we can just make it as deprecate for now to avoid breaking the existing pipeline. (Also updated the release note) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much, @itholic .
Merged to master. Thank you again, @itholic and @HyukjinKwon . |
Great work @itholic Thank you :) |
Thank you so much all for the review! |
What changes were proposed in this pull request?
This PR proposes to upgrade Pandas to 2.2.0.
See What's new in 2.2.0 (January 19, 2024)
Why are the changes needed?
Pandas 2.2.0 is released, and we should support the latest Pandas.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The existing CI should pass
Was this patch authored or co-authored using generative AI tooling?
No.