Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

Merged
merged 3 commits into from
Nov 17, 2023

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 28, 2023

Related to

At the moment, an inplace operator, like ser += 1 / df["col"] += 1, is actually first compute out of place, and then if the dtypes match, we update the original series' values inplace with the result's values. Essentially we do the following under the hood:

# ser += 1
result = ser + 1
ser._values[:] = result._values

Except that when CoW is enabled, we first trigger a copy of the underlying values if needed (if the series has references). This happens here (mgr.setitem_inplace() will copy the block if needed):

pandas/pandas/core/generic.py

Lines 12390 to 12403 in 984d755

def _inplace_method(self, other, op) -> Self:
"""
Wrap arithmetic method to operate inplace.
"""
result = op(self, other)
if self.ndim == 1 and result._indexed_same(self) and result.dtype == self.dtype:
# GH#36498 this inplace op can _actually_ be inplace.
# Item "ArrayManager" of "Union[ArrayManager, SingleArrayManager,
# BlockManager, SingleBlockManager]" has no attribute "setitem_inplace"
self._mgr.setitem_inplace( # type: ignore[union-attr]
slice(None), result._values
)
return self

This current implementation has two downsides:

  • The setitem_inplace can trigger a copy with CoW, and this copy is actually unnecessary since we can also simply replace the values instead of first copying and then updating this copy inplace (this makes df["a"] += 1 currently less efficient than df["a"] = df["a"] + 1, because the latter does not trigger an additional copy).
  • We sometimes update the values inplace, and sometimes not. Without CoW, at least we know exactly when (only for series and when the result has the exact same dtype), but with CoW it depends on whether the series has a reference or not.
    The only way that this matters to the user is when they rely on having the underlying values and expect them to be updated (eg ser = pd.Series(arr, copy=False); ser += 1 and expecting arr was also updated). We don't want user to rely on this (since with CoW it can be unreliable anyway), and so the more consistent behaviour would be to simply never update the original array inplace.

From a user perspective, it doesn't matter whether we update the underlying values inplace (ser._values[:] = result._values) or swap them out (like ser._values = result._values). And there is no efficiency reason to do it inplace, because we already calculated the result in a separate variable anyway, and we are replacing the all data anyway.

So I think we can simply decide that inplace operations on a full series or column only operate inplace on the pandas object, and will never update the underlying values.

(note that this doesn't cover inplace operators with subsets, like df.loc[0, 0] += 1 or ser[0] += 1)

xref #48998

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this change, not doing it inplace is more efficient anyway if we are operating on the whole object.

I don't like the inconsistency with subsets though, thoughts?

@jorisvandenbossche
Copy link
Member Author

I don't like the inconsistency with subsets though, thoughts?

I think that if you see the inplace operator as kind of syntactic sugar:

# df["col"] += 1
df["col"] = df["col"] + 1
# df.iloc[0, 0] += 1
df.iloc[0, 0] = df.iloc[0, 0] + 1

then it makes sense that the one is just replacing the column while the other one is inplace.

That analogy doesn't fully hold for ser += 1, though (although maybe ser = ser + 1 comparison still makes sense, that would also replace your full values)

assert np.shares_memory(get_array(ser), data)
tm.assert_numpy_array_equal(data, get_array(ser))
if using_copy_on_write:
# changed to NOT update inplace because there is no benefit (actual
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH ref pointing back to this PR/discussion

@jbrockmendel
Copy link
Member

no strong opinion here. seems like inplace arithmetic is something we want to discourage going forward since it'll have surprising-unless-you-carefully-read-the-docs behavior?

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel what is the "suprising" behaviour in your idea? The fact that it is not "inplace" in the sense like a numpy array does it inplace in the same memory place?
Because that is something that has never been the case for pandas, this PR is not changing that (also before this PR, we were doing the actual computation "out of place", and then just writing the resulting values to the original data). And that is also something that you generally don't care about as a user (certainly with CoW, you should never rely on how the underlying values are treated, as they can always get copied when you do some mutation. But even before CoW there was no guarantee about it, eg when the dtype changed)

Also for a dataframe (df["col"] += 1) is still "inplace" if you consider it on the dataframe level: it does update the dataframe inplace.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 17, 2023 10:30
@jorisvandenbossche jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023
@phofl phofl merged commit ef2c61a into pandas-dev:main Nov 17, 2023
40 checks passed
@phofl
Copy link
Member

phofl commented Nov 17, 2023

thx @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the cow-column-inplace-op branch November 27, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants