-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745
CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745
Conversation
…for inplace operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with this change, not doing it inplace is more efficient anyway if we are operating on the whole object.
I don't like the inconsistency with subsets though, thoughts?
I think that if you see the inplace operator as kind of syntactic sugar: # df["col"] += 1
df["col"] = df["col"] + 1
# df.iloc[0, 0] += 1
df.iloc[0, 0] = df.iloc[0, 0] + 1 then it makes sense that the one is just replacing the column while the other one is inplace. That analogy doesn't fully hold for |
assert np.shares_memory(get_array(ser), data) | ||
tm.assert_numpy_array_equal(data, get_array(ser)) | ||
if using_copy_on_write: | ||
# changed to NOT update inplace because there is no benefit (actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GH ref pointing back to this PR/discussion
no strong opinion here. seems like inplace arithmetic is something we want to discourage going forward since it'll have surprising-unless-you-carefully-read-the-docs behavior? |
@jbrockmendel what is the "suprising" behaviour in your idea? The fact that it is not "inplace" in the sense like a numpy array does it inplace in the same memory place? Also for a dataframe ( |
Related to
At the moment, an inplace operator, like
ser += 1
/df["col"] += 1
, is actually first compute out of place, and then if the dtypes match, we update the original series' values inplace with the result's values. Essentially we do the following under the hood:Except that when CoW is enabled, we first trigger a copy of the underlying values if needed (if the series has references). This happens here (
mgr.setitem_inplace()
will copy the block if needed):pandas/pandas/core/generic.py
Lines 12390 to 12403 in 984d755
This current implementation has two downsides:
setitem_inplace
can trigger a copy with CoW, and this copy is actually unnecessary since we can also simply replace the values instead of first copying and then updating this copy inplace (this makesdf["a"] += 1
currently less efficient thandf["a"] = df["a"] + 1
, because the latter does not trigger an additional copy).The only way that this matters to the user is when they rely on having the underlying values and expect them to be updated (eg
ser = pd.Series(arr, copy=False); ser += 1
and expectingarr
was also updated). We don't want user to rely on this (since with CoW it can be unreliable anyway), and so the more consistent behaviour would be to simply never update the original array inplace.From a user perspective, it doesn't matter whether we update the underlying values inplace (
ser._values[:] = result._values
) or swap them out (likeser._values = result._values
). And there is no efficiency reason to do it inplace, because we already calculated the result in a separate variable anyway, and we are replacing the all data anyway.So I think we can simply decide that inplace operations on a full series or column only operate inplace on the pandas object, and will never update the underlying values.
(note that this doesn't cover inplace operators with subsets, like
df.loc[0, 0] += 1
orser[0] += 1
)xref #48998