-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling conversion of empty categorical with dtype_backend='pyarrow' #59935
Handling conversion of empty categorical with dtype_backend='pyarrow' #59935
Conversation
…pe_backend results in error. Since conversion of non-empty categorical returns categorical of 'numpy_nullable' dtype_backend, now, instead of raising an error, we ensure empty categorical is returned as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Will also need a note in doc/source/whatsnew/v3.0.0
in the Categorical
subsection of Bug fixes
.
pandas/core/dtypes/cast.py
Outdated
if isna(input_array).all() and hasattr(input_array, 'categories'): | ||
inferred_dtype = input_array.dtype | ||
else: | ||
inferred_dtype = ArrowDtype(pa_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, can you add the check not isinstance(input_array.dtype, CategoricalDtype)
to the if statement that starts on L1142. Then in the case to categorical dtypes, to_pyarrow_type
returns None
on L1151 and we do not modify inferred_dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
df = pd.DataFrame( | ||
{ | ||
"A": pd.Series(pd.Categorical([None] * 5)), | ||
"B": pd.Series(pd.Categorical([None] * 5, categories=["B1", "B2"])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wrap in pd.Series
here; supplying pd.Cateogrical
is sufficient in the constructor.
@@ -297,3 +297,16 @@ def test_convert_dtypes_pyarrow_null(self): | |||
result = ser.convert_dtypes(dtype_backend="pyarrow") | |||
expected = pd.Series([None, None], dtype=pd.ArrowDtype(pa.null())) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_convert_empty_categorical_to_pyarrow(self): | |||
ser = pd.Series(pd.Series(pd.Categorical([None] * 5))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar - You can remove one pd.Series
here and can you add # GH#59934
as the first line of this test.
ser2 = pd.Series(pd.Series(pd.Categorical([None] * 5, categories=["S1", "S2"]))) | ||
assert (ser2.cat.categories == ["S1", "S2"]).all(), "Series categories are not empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we need to make sure that the set categories in empty categorical are propagated after conversion... i've adapted this in new pr
Here's the new PR: |
@veljanin - please do not open a new PR, just push commits to this PR branch. |
All changes should be in place now... |
doc/source/whatsnew/v3.0.0.rst
Outdated
@@ -544,7 +544,7 @@ Bug fixes | |||
|
|||
Categorical | |||
^^^^^^^^^^^ | |||
- | |||
- Bug in :func:`convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in :func:`convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`) | |
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend='pyarrow'`` parameter where empty categorical series raise error or get converted to null[pyarrow] (:issue:`59934`) |
pandas/core/dtypes/cast.py
Outdated
@@ -1151,6 +1152,7 @@ def convert_dtypes( | |||
pa_type = to_pyarrow_type(base_dtype) | |||
if pa_type is not None: | |||
inferred_dtype = ArrowDtype(pa_type) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert the addition of this newline
@veljanin - I think changes look good, can you merge main and resolve the conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think skipping is needed to pass the tests.
@@ -297,3 +297,21 @@ def test_convert_dtypes_pyarrow_null(self): | |||
result = ser.convert_dtypes(dtype_backend="pyarrow") | |||
expected = pd.Series([None, None], dtype=pd.ArrowDtype(pa.null())) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_convert_empty_categorical_to_pyarrow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_convert_empty_categorical_to_pyarrow(self): | |
def test_convert_empty_categorical_to_pyarrow(self): | |
pytest.importorskip("pyarrow") |
just added skip_if_no decorator to both tests |
…ty_categorical' into fixing_pyarrow_conversion_of_empty_categorical
@veljanin - the CI failures are related and need to be addressed. |
…ersion_of_empty_categorical
ci/code_checks.sh
Outdated
@@ -73,9 +73,30 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then | |||
-i "pandas.Period.freq GL08" \ | |||
-i "pandas.Period.ordinal GL08" \ | |||
-i "pandas.RangeIndex.from_range PR01,SA01" \ | |||
-i "pandas.Series.cat.add_categories PR01,PR02" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like merge conflict-resolution went the wrong way here. I think these should likely all be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied the changes, should be OK now
assert converted.A.dtype == "category", "Dtype in column A is not 'category'" | ||
assert converted.B.dtype == "category", "Dtype in column B is not 'category'" | ||
assert converted.A.cat.categories.empty, "Categories in column A are not empty" | ||
assert converted.B.cat.categories.isin( | ||
["B1", "B2"] | ||
).all(), "Categories in column B doesn't contain adequate categories" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all covered in tm.assert_frame_equal
above and can be removed.
assert converted1.dtype == "category", "Series dtype is not 'category'" | ||
assert converted1.cat.categories.empty, "Series categories are not empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar, can remove.
assert converted2.cat.categories.isin( | ||
["S1", "S2"] | ||
).all(), "Categories in ser2 doesn't contain adequate categories" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar, can remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also get a note in the whatsnew for v3.0.0
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing as stale. @veljanin - if you'd like to continue, address the above comments and we're happy to reopen! |
Handling the case where conversion of empty categorical series with specified dtype_backend='pyarrow' in convert_dtypes function results in 1) an error (when there are no defined categories) or 2) in null[pyarrow] dtype (when there are some defined categories).