CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

jorisvandenbossche · 2023-10-28T14:48:24Z

Related to

At the moment, an inplace operator, like ser += 1 / df["col"] += 1, is actually first compute out of place, and then if the dtypes match, we update the original series' values inplace with the result's values. Essentially we do the following under the hood:

# ser += 1
result = ser + 1
ser._values[:] = result._values

Except that when CoW is enabled, we first trigger a copy of the underlying values if needed (if the series has references). This happens here (mgr.setitem_inplace() will copy the block if needed):

pandas/pandas/core/generic.py

Lines 12390 to 12403 in 984d755

    
               def _inplace_method(self, other, op) -> Self: 
        
                   """ 
        
                   Wrap arithmetic method to operate inplace. 
        
                   """ 
        
                   result = op(self, other) 
        
                   if self.ndim == 1 and result._indexed_same(self) and result.dtype == self.dtype: 
        
                       # GH#36498 this inplace op can _actually_ be inplace. 
        
                       # Item "ArrayManager" of "Union[ArrayManager, SingleArrayManager, 
        
                       # BlockManager, SingleBlockManager]" has no attribute "setitem_inplace" 
        
                       self._mgr.setitem_inplace(  # type: ignore[union-attr] 
        
                           slice(None), result._values 
        
                       ) 
        
                       return self

This current implementation has two downsides:

The setitem_inplace can trigger a copy with CoW, and this copy is actually unnecessary since we can also simply replace the values instead of first copying and then updating this copy inplace (this makes df["a"] += 1 currently less efficient than df["a"] = df["a"] + 1, because the latter does not trigger an additional copy).
We sometimes update the values inplace, and sometimes not. Without CoW, at least we know exactly when (only for series and when the result has the exact same dtype), but with CoW it depends on whether the series has a reference or not.
The only way that this matters to the user is when they rely on having the underlying values and expect them to be updated (eg ser = pd.Series(arr, copy=False); ser += 1 and expecting arr was also updated). We don't want user to rely on this (since with CoW it can be unreliable anyway), and so the more consistent behaviour would be to simply never update the original array inplace.

From a user perspective, it doesn't matter whether we update the underlying values inplace (ser._values[:] = result._values) or swap them out (like ser._values = result._values). And there is no efficiency reason to do it inplace, because we already calculated the result in a separate variable anyway, and we are replacing the all data anyway.

So I think we can simply decide that inplace operations on a full series or column only operate inplace on the pandas object, and will never update the underlying values.

(note that this doesn't cover inplace operators with subsets, like df.loc[0, 0] += 1 or ser[0] += 1)

xref #48998

…for inplace operator

phofl

I am ok with this change, not doing it inplace is more efficient anyway if we are operating on the whole object.

I don't like the inconsistency with subsets though, thoughts?

jorisvandenbossche · 2023-10-31T14:17:49Z

I don't like the inconsistency with subsets though, thoughts?

I think that if you see the inplace operator as kind of syntactic sugar:

# df["col"] += 1
df["col"] = df["col"] + 1
# df.iloc[0, 0] += 1
df.iloc[0, 0] = df.iloc[0, 0] + 1

then it makes sense that the one is just replacing the column while the other one is inplace.

That analogy doesn't fully hold for ser += 1, though (although maybe ser = ser + 1 comparison still makes sense, that would also replace your full values)

jbrockmendel · 2023-10-31T23:03:42Z

pandas/tests/copy_view/test_methods.py

-    assert np.shares_memory(get_array(ser), data)
-    tm.assert_numpy_array_equal(data, get_array(ser))
+    if using_copy_on_write:
+        # changed to NOT update inplace because there is no benefit (actual


GH ref pointing back to this PR/discussion

jbrockmendel · 2023-10-31T23:04:54Z

no strong opinion here. seems like inplace arithmetic is something we want to discourage going forward since it'll have surprising-unless-you-carefully-read-the-docs behavior?

jorisvandenbossche · 2023-11-01T02:21:58Z

@jbrockmendel what is the "suprising" behaviour in your idea? The fact that it is not "inplace" in the sense like a numpy array does it inplace in the same memory place?
Because that is something that has never been the case for pandas, this PR is not changing that (also before this PR, we were doing the actual computation "out of place", and then just writing the resulting values to the original data). And that is also something that you generally don't care about as a user (certainly with CoW, you should never rely on how the underlying values are treated, as they can always get copied when you do some mutation. But even before CoW there was no guarantee about it, eg when the dtype changed)

Also for a dataframe (df["col"] += 1) is still "inplace" if you consider it on the dataframe level: it does update the dataframe inplace.

phofl · 2023-11-17T20:17:08Z

thx @jorisvandenbossche

CoW - don't try to update underlying values of Series/column inplace …

34eee1e

…for inplace operator

jorisvandenbossche added the Copy / view semantics label Oct 28, 2023

jorisvandenbossche requested review from phofl and jbrockmendel October 28, 2023 14:49

phofl reviewed Oct 30, 2023

View reviewed changes

jbrockmendel reviewed Oct 31, 2023

View reviewed changes

jorisvandenbossche added 2 commits November 17, 2023 11:29

Merge remote-tracking branch 'upstream/main' into cow-column-inplace-op

4e499cb

add gh ref

bfb0ad0

jorisvandenbossche marked this pull request as ready for review November 17, 2023 10:30

jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023

phofl approved these changes Nov 17, 2023

View reviewed changes

phofl merged commit ef2c61a into pandas-dev:main Nov 17, 2023
40 checks passed

jorisvandenbossche deleted the cow-column-inplace-op branch November 27, 2023 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

jorisvandenbossche commented Oct 28, 2023 •

edited

Loading

phofl left a comment

jorisvandenbossche commented Oct 31, 2023

jbrockmendel Oct 31, 2023

jbrockmendel commented Oct 31, 2023

jorisvandenbossche commented Nov 1, 2023

phofl commented Nov 17, 2023

	def _inplace_method(self, other, op) -> Self:
	"""
	Wrap arithmetic method to operate inplace.
	"""
	result = op(self, other)

	if self.ndim == 1 and result._indexed_same(self) and result.dtype == self.dtype:
	# GH#36498 this inplace op can _actually_ be inplace.
	# Item "ArrayManager" of "Union[ArrayManager, SingleArrayManager,
	# BlockManager, SingleBlockManager]" has no attribute "setitem_inplace"
	self._mgr.setitem_inplace( # type: ignore[union-attr]
	slice(None), result._values
	)
	return self

CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

CoW - don't try to update underlying values of Series/column inplace for inplace operator #55745

Conversation

jorisvandenbossche commented Oct 28, 2023 • edited Loading

phofl left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 31, 2023

jbrockmendel Oct 31, 2023

Choose a reason for hiding this comment

jbrockmendel commented Oct 31, 2023

jorisvandenbossche commented Nov 1, 2023

phofl commented Nov 17, 2023

jorisvandenbossche commented Oct 28, 2023 •

edited

Loading