Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling conversion of empty categorical with dtype_backend='pyarrow' #59935

Conversation

veljanin
Copy link

@veljanin veljanin commented Oct 2, 2024

Handling the case where conversion of empty categorical series with specified dtype_backend='pyarrow' in convert_dtypes function results in 1) an error (when there are no defined categories) or 2) in null[pyarrow] dtype (when there are some defined categories).

…pe_backend results in error. Since conversion of non-empty categorical returns categorical of 'numpy_nullable' dtype_backend, now, instead of raising an error, we ensure empty categorical is returned as well.
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Will also need a note in doc/source/whatsnew/v3.0.0 in the Categorical subsection of Bug fixes.

Comment on lines 1153 to 1156
if isna(input_array).all() and hasattr(input_array, 'categories'):
inferred_dtype = input_array.dtype
else:
inferred_dtype = ArrowDtype(pa_type)
Copy link
Member

@rhshadrach rhshadrach Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, can you add the check not isinstance(input_array.dtype, CategoricalDtype) to the if statement that starts on L1142. Then in the case to categorical dtypes, to_pyarrow_type returns None on L1151 and we do not modify inferred_dtype.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

df = pd.DataFrame(
{
"A": pd.Series(pd.Categorical([None] * 5)),
"B": pd.Series(pd.Categorical([None] * 5, categories=["B1", "B2"])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wrap in pd.Series here; supplying pd.Cateogrical is sufficient in the constructor.

@@ -297,3 +297,16 @@ def test_convert_dtypes_pyarrow_null(self):
result = ser.convert_dtypes(dtype_backend="pyarrow")
expected = pd.Series([None, None], dtype=pd.ArrowDtype(pa.null()))
tm.assert_series_equal(result, expected)

def test_convert_empty_categorical_to_pyarrow(self):
ser = pd.Series(pd.Series(pd.Categorical([None] * 5)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar - You can remove one pd.Series here and can you add # GH#59934 as the first line of this test.

Comment on lines 311 to 312
ser2 = pd.Series(pd.Series(pd.Categorical([None] * 5, categories=["S1", "S2"])))
assert (ser2.cat.categories == ["S1", "S2"]).all(), "Series categories are not empty"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we need to make sure that the set categories in empty categorical are propagated after conversion... i've adapted this in new pr

@rhshadrach rhshadrach added Bug Categorical Categorical Data Type labels Oct 2, 2024
@veljanin
Copy link
Author

veljanin commented Oct 3, 2024

Here's the new PR:
#59952

@rhshadrach
Copy link
Member

@veljanin - please do not open a new PR, just push commits to this PR branch.

@veljanin
Copy link
Author

veljanin commented Oct 4, 2024

All changes should be in place now...

@@ -544,7 +544,7 @@ Bug fixes

Categorical
^^^^^^^^^^^
-
- Bug in :func:`convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :func:`convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`)
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`)

@@ -1151,6 +1152,7 @@ def convert_dtypes(
pa_type = to_pyarrow_type(base_dtype)
if pa_type is not None:
inferred_dtype = ArrowDtype(pa_type)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the addition of this newline

@rhshadrach
Copy link
Member

@veljanin - I think changes look good, can you merge main and resolve the conflict.

Copy link
Contributor

@yuanx749 yuanx749 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think skipping is needed to pass the tests.

@@ -297,3 +297,21 @@ def test_convert_dtypes_pyarrow_null(self):
result = ser.convert_dtypes(dtype_backend="pyarrow")
expected = pd.Series([None, None], dtype=pd.ArrowDtype(pa.null()))
tm.assert_series_equal(result, expected)

def test_convert_empty_categorical_to_pyarrow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_convert_empty_categorical_to_pyarrow(self):
def test_convert_empty_categorical_to_pyarrow(self):
pytest.importorskip("pyarrow")

@veljanin
Copy link
Author

just added skip_if_no decorator to both tests

…ty_categorical' into fixing_pyarrow_conversion_of_empty_categorical
@rhshadrach
Copy link
Member

@veljanin - the CI failures are related and need to be addressed.

@@ -73,9 +73,30 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
-i "pandas.Period.freq GL08" \
-i "pandas.Period.ordinal GL08" \
-i "pandas.RangeIndex.from_range PR01,SA01" \
-i "pandas.Series.cat.add_categories PR01,PR02" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like merge conflict-resolution went the wrong way here. I think these should likely all be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied the changes, should be OK now

Comment on lines +53 to +58
assert converted.A.dtype == "category", "Dtype in column A is not 'category'"
assert converted.B.dtype == "category", "Dtype in column B is not 'category'"
assert converted.A.cat.categories.empty, "Categories in column A are not empty"
assert converted.B.cat.categories.isin(
["B1", "B2"]
).all(), "Categories in column B doesn't contain adequate categories"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all covered in tm.assert_frame_equal above and can be removed.

Comment on lines +310 to +311
assert converted1.dtype == "category", "Series dtype is not 'category'"
assert converted1.cat.categories.empty, "Series categories are not empty"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, can remove.

Comment on lines +315 to +317
assert converted2.cat.categories.isin(
["S1", "S2"]
).all(), "Categories in ser2 doesn't contain adequate categories"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, can remove.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also get a note in the whatsnew for v3.0.0

@rhshadrach rhshadrach added this to the 3.0 milestone Nov 7, 2024
Copy link
Contributor

github-actions bot commented Dec 8, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 8, 2024
@rhshadrach
Copy link
Member

Closing as stale. @veljanin - if you'd like to continue, address the above comments and we're happy to reopen!

@rhshadrach rhshadrach closed this Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Stale
Projects
None yet
3 participants