Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

Closed
wants to merge 31 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jan 25, 2024

What changes were proposed in this pull request?

This PR proposes to upgrade Pandas to 2.2.0.

See What's new in 2.2.0 (January 19, 2024)

Why are the changes needed?

Pandas 2.2.0 is released, and we should support the latest Pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing CI should pass

Was this patch authored or co-authored using generative AI tooling?

No.

dev/infra/Dockerfile Outdated Show resolved Hide resolved
@itholic itholic changed the title [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 Jan 26, 2024
@itholic itholic marked this pull request as draft January 26, 2024 01:12
dev/infra/Dockerfile Outdated Show resolved Hide resolved
Comment on lines 10610 to 10612
elif isinstance(var_name, str):
elif is_list_like(var_name):
raise ValueError(f"{var_name=} must be a scalar.")
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic itholic changed the title [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 Feb 13, 2024
@itholic itholic marked this pull request as ready for review February 13, 2024 10:42
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Pandas seems to change again. 😞

AssertionError: Series are different

Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]:  [0, -1, NaN]
[right]: [0, -1, None]

During handling of the above exception, another exception occurred:

Could you check the failures?

@itholic
Copy link
Contributor Author

itholic commented Feb 14, 2024

Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes 😢

Let me fix them. Thanks for the confirm!

Comment on lines +370 to +371
def _calculate_bins(self, data, bins):
return bins
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas recently pushed couple of commits for refactoring the internal plotting structure such as pandas-dev/pandas#55850 or pandas-dev/pandas#55872, so we also should inherits couple of internal methods to follow the latest Pandas behavior.

Comment on lines -2557 to +2560
new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
if not ignore_index and not should_return_series:
new_objs.append(obj.to_frame())
else:
new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic itholic marked this pull request as ready for review February 20, 2024 06:11
@itholic
Copy link
Contributor Author

itholic commented Feb 20, 2024

I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc @HyukjinKwon @dongjoon-hyun FYI

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions.

  • Is the change of python/pyspark/pandas/resample.py safe?
  • What happens when the users decide to use old Pandas (<= 2.2.0)?

@itholic
Copy link
Contributor Author

itholic commented Feb 20, 2024

  • Is the change of python/pyspark/pandas/resample.py safe?

It breaks the previous behavior, so if we plan to release other minor release (Spark 3.6.0) this should not be included.

  • What happens when the users decide to use old Pandas (<= 2.2.0)?

Using deprecated aliases (Y, M, H, T, S) wouldn't work.

@itholic
Copy link
Contributor Author

itholic commented Feb 20, 2024

We should not bring any breaking change. Let me address them.

Thanks, @dongjoon-hyun for double checking.

@itholic
Copy link
Contributor Author

itholic commented Feb 20, 2024

Oh, wait.

I just remembered that we just follow the Pandas behavior and separately mention the breaking changes into release note.

- In Spark 4.0, it is recommended to use Pandas version 2.0.0 or above with PySpark for optimal compatibility.
- In Spark 4.0, the minimum supported version for Pandas has been raised from 1.0.5 to 1.4.4 in PySpark.
...
- In Spark 4.0, when applying astype to a decimal type object, the existing missing value is changed to True instead of False from Pandas API on Spark.
- In Spark 4.0, pyspark.testing.assertPandasOnSparkEqual has been removed from Pandas API on Spark, use pyspark.pandas.testing.assert_frame_equal instead.

So maybe we should add a release note instead of reverting the breaking changes here? @dongjoon-hyun @HyukjinKwon

@github-actions github-actions bot added the DOCS label Feb 20, 2024
@itholic
Copy link
Contributor Author

itholic commented Feb 20, 2024

Just updated to resample work in old Pandas as well.

I think we can just make it as deprecate for now to avoid breaking the existing pipeline. (Also updated the release note)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @itholic .

@dongjoon-hyun
Copy link
Member

Merged to master.

Thank you again, @itholic and @HyukjinKwon .

@bjornjorgensen
Copy link
Contributor

Great work @itholic Thank you :)

@itholic
Copy link
Contributor Author

itholic commented Feb 21, 2024

Thank you so much all for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants