-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: overview of breaking behaviour changes #59328
Comments
I am willing to work on this task. |
@harshmarke there is not (yet) something directly actionable in this issue to work on. This issue is for now just meant to keep track of and discuss changes that we will need document at some later point. |
During the dev call on 8/28, @jbrockmendel brought up this issue. @mroeschke, @jbrockmendel and I were slightly in favor of keeping the current behavior (allowing |
I am also in favor of allowing |
One data point is that we have been disallowing this for the nullable StringDtype and ArrowDtype(string) for quite a while, and (as far as I am aware / could find) no one raised an issue about this. On the other hand, we explicitly do allow addition between two string operands (like |
I have personally had use cases where I wanted to summarize sequence data, e.g. df = pd.DataFrame(
{
"group": ["A", "B", "A", "A", "B", "C"],
"location": ["0", "1", "3", "4", "2", "5"],
}
)
result = df.assign(location=df["location"] + ", ").groupby("group")["location"].sum().str[:-2]
print(result)
# group
# A 0, 3, 4
# B 1, 2
# C 5
# Name: location, dtype: object |
Given the above feedback, let's add a Sidenote: I think there might be room for a "specialized" string concatenation (reduction) method (reduction variant of |
I also updated the top post to add a section about |
I like the idea of having a dedicated function for this; I think most users don't expect |
Why do you think this? It seems to me |
Python is not consistent in how this is handled. Using the built-in >>> "a" + "b" + "c"
'abc'
>>> sum(["a", "b", "c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str' I agree that |
The reason Python does this is for performance: the generic implementation of |
following #60296 this can now be removed? |
There maybe some value to adding the breaking changes to the documentation (instead of tracking here) so that we can link from the 2.3 release notes |
Good point, that indeed now works again (I would personally find that a good change, but it's something we could definitely do trough a deprecation cycle, so no need to change it now)
Yes, see the first sentence of the top post, this issue is gathering the changes with the goal of documenting them |
yep. the comment was to update the OP not to deprecate/change anything
so the milestone on this issue should be 2.3 and not 3.0? |
for missing values, We ignore missing values in so we will have the following change >>> pd.options.future.infer_string = False
>>> pd.Series(["a", "b", None]).all()
True
>>>
>>> all(pd.Series(["a", "b", None]))
False
>>>
>>> pd.Series(["a", "b", None]).all(skipna=False)
False
>>>
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None]).all()
True
>>>
>>> all(pd.Series(["a", "b", None]))
True
>>>
>>> pd.Series(["a", "b", None]).all(skipna=False)
True
>>> is this too nuanced to include in the breaking changes docs? or would be included in the |
We could mention it indeed there, because this is not actually related to our |
Sorry, that was a bit too optimistic, because also with And also before, |
In context of the new default string dtype in 3.0 (#54792 / PDEP-14), currently enabled with
pd.options.future.infer_string = True
, there are a bunch of breaking changes that we will have to document.In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.
First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main goals of the change):
str
dtype, instead of usingobject
dtype.ser.dtype == object
) assuming object dtype, will breakNaN
, and for example no longerNone
(we still accept None as input, but it will be converted to NaN)But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in
string
dtype but will now be relevant for all).Starting to list some of them here (and please add comments with other examples if you think of more).
astype(str)
preserving missing values (no longer converting NaN to a string "nan")This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in #25353.
Currently something like
pd.Series(["foo", np.nan]).astype(str)
would essentially convert every element to a string, including the missing values:Generally we expect missing values to propagate in
astype()
. And as a result of makingstr
an alias for the new default string dtype (#59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;Because
Mixed dtype operations
Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).
For example, setitem with a non string:
The same happens if you try to fill a column of strings and missing values with a non-string:Update: the above is kept working with upcasting to object dtype (see #60296)
Numeric aggregations
With object dtype strings, we do allow
sum
andprod
in certain cases:Based on the discussion below, we decided to keep
sum()
working (#59853 is adding that functionality to string dtype), butprod()
is fine to start raising.Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)
For
any()
/all()
(which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591Invalid unicode input
Users that want to keep the previous behaviour can explicitly specify
dtype=object
to keep working with object dtype.The text was updated successfully, but these errors were encountered: