Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype (2.3.x): avoid downcasting object to string in fillna/where/interpolate #60183

Conversation

jorisvandenbossche
Copy link
Member

Noticed while backporting some test updates in #60180.

For 2.x, we have a deprecation warning about "downcasting" in methods like fillna and interpolate. For example, when starting from object dtype

>>> ser = pd.Series([1, 2, None], dtype=object)
>>> ser.fillna(3)
FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change
in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior,
set `pd.set_option('future.no_silent_downcasting', True)`
0    1
1    2
2    3
dtype: int64

However, because of how this is implemented, when you enable the future string dtype and still have an object dtype column, this will be inferred as string dtype and hence again trigger that warning:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b", None], dtype=object)
>>> ser.fillna("c")
FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change
in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior,
set `pd.set_option('future.no_silent_downcasting', True)`
0    a
1    b
2    c
dtype: str

So this triggers a warning for something that will no longer happen in 3.0, but when testing with pd.options.future.infer_string = True, we are essentially already testing behaviour for 3.0, so raising this warning results in a warning the user won't actually see in pandas 3.0.

So I was thinking that we could explicitly not dowcast to string dtype in those methods, since for the new dtype we don't need to keep backwards compatibility with old downcasting behaviour.

This PR is directly targetting the 2.3.x branch, given that all this downcasting behaviour is already removed on main.

xref #54792

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 5, 2024 13:00
@@ -2741,7 +2742,11 @@ def maybe_convert_objects(ndarray[object] objects,
seen.object_ = True

elif seen.str_:
if using_string_dtype() and is_string_array(objects, skipna=True):
if (
convert_string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases would we have convert_string be False and using_string_dtype be True? Not saying this implementation is wrong, just hard to grok at first glance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combination of convert_string=False with using_string_dtype() being True is exactly what I am using inside the replace/fillna implementation, to pass down to maybe_convert_objects that I don't want to cast object to string if I know that we started with object dtype, and that this dtype should be preserved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I fully agree it is all a bit confusing, and the implementation is not very clear (I mostly got to the current state just by getting all our tests to pass, but it is difficult to assess whether it is complete. However, given that this is about "not raising a useless warning", I think it is not the worst thing if we would have missed a case where we would still raise the warning)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks that is helpful. As a nit, maybe preserve_object_dtype would be a better keyword, but not something I think is a blocker. This all needs a good cleanup past once we get through the 2.3 push anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nit, maybe preserve_object_dtype would be a better keyword

It's only specifically not converting object to string here (all other inferred types are still used), so preserve_object_dtype would be a bit too generic I think

Copy link
Contributor

github-actions bot commented Dec 6, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 6, 2024
@jorisvandenbossche
Copy link
Member Author

Going to merge this since it blocks several other backport PRs

@jorisvandenbossche jorisvandenbossche merged commit 3362822 into pandas-dev:2.3.x Dec 17, 2024
59 of 62 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-2.3.x-downcast-string branch December 17, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Downcasting Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants