[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

itholic · 2024-01-25T08:57:13Z

What changes were proposed in this pull request?

This PR proposes to upgrade Pandas to 2.2.0.

See What's new in 2.2.0 (January 19, 2024)

Why are the changes needed?

Pandas 2.2.0 is released, and we should support the latest Pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing CI should pass

Was this patch authored or co-authored using generative AI tooling?

No.

dev/infra/Dockerfile

…2.2.0

dev/infra/Dockerfile

python/pyspark/pandas/namespace.py

…2.2.0

itholic · 2024-02-13T10:41:23Z

python/pyspark/pandas/frame.py

-        elif isinstance(var_name, str):
+        elif is_list_like(var_name):
+            raise ValueError(f"{var_name=} must be a scalar.")
+        else:


Fixed from: pandas-dev/pandas#55948

dongjoon-hyun

Unfortunately, Pandas seems to change again. 😞

AssertionError: Series are different

Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]:  [0, -1, NaN]
[right]: [0, -1, None]

During handling of the above exception, another exception occurred:

Could you check the failures?

itholic · 2024-02-14T01:20:13Z

Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes 😢

Let me fix them. Thanks for the confirm!

…2.2.0

itholic · 2024-02-16T08:19:53Z

python/pyspark/pandas/plot/matplotlib.py

+    def _calculate_bins(self, data, bins):
+        return bins


Pandas recently pushed couple of commits for refactoring the internal plotting structure such as pandas-dev/pandas#55850 or pandas-dev/pandas#55872, so we also should inherits couple of internal methods to follow the latest Pandas behavior.

…2.2.0

itholic · 2024-02-20T03:06:19Z

python/pyspark/pandas/namespace.py

-            new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+            if not ignore_index and not should_return_series:
+                new_objs.append(obj.to_frame())
+            else:
+                new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))


Related to pandas-dev/pandas#15047

itholic · 2024-02-20T06:12:41Z

I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc @HyukjinKwon @dongjoon-hyun FYI

python/pyspark/pandas/series.py

dongjoon-hyun

I have two questions.

Is the change of python/pyspark/pandas/resample.py safe?
What happens when the users decide to use old Pandas (<= 2.2.0)?

itholic · 2024-02-20T06:38:29Z

Is the change of python/pyspark/pandas/resample.py safe?

It breaks the previous behavior, so if we plan to release other minor release (Spark 3.6.0) this should not be included.

What happens when the users decide to use old Pandas (<= 2.2.0)?

Using deprecated aliases (Y, M, H, T, S) wouldn't work.

itholic · 2024-02-20T06:54:41Z

We should not bring any breaking change. Let me address them.

Thanks, @dongjoon-hyun for double checking.

itholic · 2024-02-20T07:10:21Z

Oh, wait.

I just remembered that we just follow the Pandas behavior and separately mention the breaking changes into release note.

- In Spark 4.0, it is recommended to use Pandas version 2.0.0 or above with PySpark for optimal compatibility.
- In Spark 4.0, the minimum supported version for Pandas has been raised from 1.0.5 to 1.4.4 in PySpark.
...
- In Spark 4.0, when applying astype to a decimal type object, the existing missing value is changed to True instead of False from Pandas API on Spark.
- In Spark 4.0, pyspark.testing.assertPandasOnSparkEqual has been removed from Pandas API on Spark, use pyspark.pandas.testing.assert_frame_equal instead.

So maybe we should add a release note instead of reverting the breaking changes here? @dongjoon-hyun @HyukjinKwon

itholic · 2024-02-20T07:23:48Z

Just updated to resample work in old Pandas as well.

I think we can just make it as deprecate for now to avoid breaking the existing pipeline. (Also updated the release note)

dongjoon-hyun

Thank you so much, @itholic .

dongjoon-hyun · 2024-02-20T15:48:00Z

Merged to master.

Thank you again, @itholic and @HyukjinKwon .

bjornjorgensen · 2024-02-20T20:54:22Z

Great work @itholic Thank you :)

itholic · 2024-02-21T00:57:40Z

Thank you so much all for the review!

[SPARK-46858][PYTHON][PS][INFRA] Upgrade Pandas to 2.2.0

9ae857a

github-actions bot added BUILD PYTHON PANDAS API ON SPARK labels Jan 25, 2024

zhengruifeng reviewed Jan 25, 2024

View reviewed changes

dev/infra/Dockerfile Show resolved Hide resolved

zhengruifeng reviewed Jan 25, 2024

View reviewed changes

dev/infra/Dockerfile Outdated Show resolved Hide resolved

itholic changed the title ~~[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0~~ [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 Jan 26, 2024

itholic marked this pull request as draft January 26, 2024 01:12

itholic added 3 commits January 29, 2024 10:41

Merge branch 'master' of https://github.com/apache/spark into pandas_…

5caa678

…2.2.0

pin version

e9a6445

fix series default name issue

edb3d9a

itholic commented Jan 29, 2024

View reviewed changes

dev/infra/Dockerfile Outdated Show resolved Hide resolved

upperbound for PyPy3

5440381

dongjoon-hyun reviewed Jan 30, 2024

View reviewed changes

python/pyspark/pandas/namespace.py Outdated Show resolved Hide resolved

itholic added 5 commits February 13, 2024 18:59

Merge branch 'master' of https://github.com/apache/spark into pandas_…

3e66505

…2.2.0

Fix melt

8643ebd

Fix test util related changes

a8237b4

Fix more test utils

836dcfe

Fix resample test

d3c5f57

github-actions bot added SQL CONNECT labels Feb 13, 2024

itholic commented Feb 13, 2024

View reviewed changes

itholic changed the title ~~[WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0~~ [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 Feb 13, 2024

itholic marked this pull request as ready for review February 13, 2024 10:42

dongjoon-hyun reviewed Feb 13, 2024

View reviewed changes

itholic added 4 commits February 14, 2024 11:21

Rule code mapping

66f69a2

Fix booleanops tests

9d4e8a1

use proper rule code

37300e8

Fix unsupported cases

ea57fdb

itholic added 5 commits February 16, 2024 09:20

ResampleSeriesTests

f235780

Fix ReverseTests

ad67735

Merge branch 'master' of https://github.com/apache/spark into pandas_…

4e6c77a

…2.2.0

revert unrelated changes

26b7bd6

Fix plotting

4c84b2a

itholic commented Feb 16, 2024

View reviewed changes

itholic added 4 commits February 19, 2024 12:22

Fix BoxPlot

fbbaf88

Fix concat bug in Pandas

0ca4aa6

Fix DataFrame hist plot

7536263

Merge branch 'master' of https://github.com/apache/spark into pandas_…

b07e608

…2.2.0

itholic commented Feb 20, 2024

View reviewed changes

itholic marked this pull request as ready for review February 20, 2024 06:11

dongjoon-hyun reviewed Feb 20, 2024

View reviewed changes

python/pyspark/pandas/series.py Show resolved Hide resolved

dongjoon-hyun reviewed Feb 20, 2024

View reviewed changes

Add release note

acd7b7f

github-actions bot added the DOCS label Feb 20, 2024

itholic added 2 commits February 20, 2024 16:16

has -> have

d560825

Make resample work in old pandas as well

6de7931

HyukjinKwon approved these changes Feb 20, 2024

View reviewed changes

dongjoon-hyun approved these changes Feb 20, 2024

View reviewed changes

dongjoon-hyun closed this in 8e82887 Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

itholic commented Jan 25, 2024

itholic Feb 13, 2024

dongjoon-hyun left a comment

itholic commented Feb 14, 2024

itholic Feb 16, 2024

itholic Feb 20, 2024

itholic commented Feb 20, 2024

dongjoon-hyun left a comment

itholic commented Feb 20, 2024 •

edited

Loading

itholic commented Feb 20, 2024

itholic commented Feb 20, 2024 •

edited

Loading

itholic commented Feb 20, 2024 •

edited

Loading

dongjoon-hyun left a comment

dongjoon-hyun commented Feb 20, 2024

bjornjorgensen commented Feb 20, 2024

itholic commented Feb 21, 2024 •

edited

Loading

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 #44881

Conversation

itholic commented Jan 25, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

itholic Feb 13, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

itholic commented Feb 14, 2024

itholic Feb 16, 2024

Choose a reason for hiding this comment

itholic Feb 20, 2024

Choose a reason for hiding this comment

itholic commented Feb 20, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

itholic commented Feb 20, 2024 • edited Loading

itholic commented Feb 20, 2024

itholic commented Feb 20, 2024 • edited Loading

itholic commented Feb 20, 2024 • edited Loading

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 20, 2024

bjornjorgensen commented Feb 20, 2024

itholic commented Feb 21, 2024 • edited Loading

itholic commented Feb 20, 2024 •

edited

Loading

itholic commented Feb 20, 2024 •

edited

Loading

itholic commented Feb 20, 2024 •

edited

Loading

itholic commented Feb 21, 2024 •

edited

Loading