-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Views, Copies, and the SettingWithCopyWarning Issue #10954
Comments
@nickeubank you realize that it is simple enough to just change the default to raise on the error
|
@jreback Yes, and I've done that in my own code. But I think the issue of inconsistency of behavior remains, and not all users (especially newbies and non-programmers) will be aware of that / know they should be aware of that. |
@nickeubank I think you are missing a lot of the point here.
simplest and best answer is just to change the default to raise. The warning is pretty good and has very very few (if any) false positives (the thought was that we don't want to show exceptions when it might not be a problem, and that is why it is was a warning to begin with). Writing better docs helps a little, but most people simply don't read them. The reason you need views is to avoid copying everything each time. One of the main purposes of numpy is to essentially share memory when you can, to avoid the copies.
so you want to copy this every time I want to use it? If its small it doesn't matter. But if this is a non-trivial size you will eat up memory amazingly fast. and the point of pandas will be gone. You might as well just read in your data anew each time from a csv file for each operation. |
@jreback I've floated that as Solution 3, and I think that's better than nothing, but I think my concern is that it's going to result in lots of people submitting requests for explanations for why code sometimes fails. Re: performance: my preferred solution is Solution 1, which I don't think eliminates views, it just moves them away from the user. But that's my view, and why I want to solicit input from others! [Edit: Modified to response to some of jreback's additional comments 11:26 pst] |
+1 on setting the default to raise... better clean errors (and easy solutions) than silently wrong analysis results... Also, I think the docs should encourage the usage of |
Thanks for writing this up! I agree with most of the concerns. I wonder if solution 1 (copy on write for views) is technologically feasible. That seems like the best of the alternatives to me, and it's also what R and MATLAB do. It's not always desirable, but using a tool like NumPy directly is already necessary for high performance code. I'm also not a fan of a new indexer, and I think making making SettingWithCopy warning an error would be a mistake, because chained indexing (with indexing appearing on different liens) can sometimes be the clearest way to write code. |
@shoyer well, pandas already DOES copy-on-write. That's exactly what
The setting has no idea that it is in fact part of a chain. So it if you make a copy at this point, you are setting on the copy, if you don't you STILL maybe setting a copy from the previous operation. This is the entire problem. IF pandas had lazy evaluation then this would be a no-brainer, but it is eager and not possible to resolve this. copy-on-write will not fix the issue at all. Always returning a copy would fix this, but is way too expensive to be a soln. The soln is either or both of:
|
@jreback I think we have some misunderstanding about what "copy-on-write" means? I am referring to the array being indexed (e.g., If we had copy-on-write, I do understand the issues with Python syntax and lazy evaluation. I don't think we need to always make copies with indexing -- that would indeed be way too expensive in general. But instead of how we currently mark objects so we can later issue the SettingWithCopy warning, we could simply make a copy of the object being indexed at that point and then proceed with the indexing operation. The only difference is that we would only need to mark objects created with a view rather than all indexing results. |
How do you actually know when to do this? This is the problem. You don't know the future operation. |
Perhaps I'm missing something? I'm pretty sure the only future operations where we would need to make copies are:
There's also the issue of setting with |
I think we might be talking past each other a little bit here – I think I can clarify matters in a couple hours when I finish a set of meetings... Sorry for the delay. |
@jreback and I were discussing this on gitter. One interesting case is supporting chained indexing via a series. This currently works (via views) and doesn't even raise a SettingWithCopy warning:
This is convenient, and it's actually possible to guarantee that it works 100% of the time with views. I expect it's also widely relied upon. So a possible rule is:
|
however cannot guarantee that a single column is actually a view if it's say object dtype. it may be possible but I am not sure. IF I would allow mom consolidation (eg columns map directly to an individual block); then I think this is possible. |
Thanks for wrestling with this @jreback and @shoyer . Just to be clear on what what's going on (and to make sure I follow correctly), do the following code snippets correctly correctly characterize what we're discussing? Code:
Current Behavior
New (suggested) behaviorMy solution 1, and what I think @shoyer has been suggesting with all slices behaving "as-if" they are copies:
However, noting that in the "new" behavior, slice 1 may have generated a view behind the scenes. Chained-Slicing on a Single LineUnder this new regime, of course, we would still see a failure (but now always see a failure) from:
Since this would result in the creation of a new I'm ok if not -- as long as it always fails, is any easy idiom to tell users to avoid, and fails due to a simple principle (slicing always returns something that behaves like a copy, which means this fails for the same reason that
fails. "as-if" CopiesAnd also just to be clear, @jreback , you asked previously if I was suggesting that
would always generate a copy. Hopefully this is a little clear now, but my suggestion would be "No", |
In my experience, my students and I often run into this issue and it is one of the few (very) rough edges in pandas. In most usage cases of pandas it "just does the right thing", but with the issue being raised here that is not the case. If I understand the situation, I think the most problematic part is that the behavior depends on state that is hidden from the user. I think that hidden state needs to be removed or shown to the user before they make these API calls. |
@nickeubank so I implemented the above here:
Basically the setting on copy machinery already tracked this, so just a matter of actually doing something. Chained indexing should I could actually make work, but might be some more hoops / complexity, so an exception might just be better (as I am doing now). |
@jreback Thanks so much! I really think this is a great improvement! |
@jreback Using garbage collection to check for chained indexing is a nice trick! It does seem fragile, though. In particular, I wonder if there are strange cases (e.g., unit tests?) that could run into this inadvertently... |
if u have a better trick that can differentiate these cases - all ears |
@jreback The alternative would be to not issue a warning or error for chained indexing at all (similar to NumPy), now that it is entirely predictable whether DataFrame indexing returns a copy or a view. |
I'm agnostic. A warning might be nice, but the behavior no longer feels pathological or unexpected, so I'm less worried. @jreback @shoyer You two had a conversation about still using views when someone slices a full column -- I'm open to that (since it will be consistent), but I would prefer we didn't. If even a slice of a column behaves "as-if" it were a copy, then users will never be required to think about or even understand views. I think that has the potentially to really improve the accessibility of |
@nickeubank The reason why I think we should use views when selecting a single column is that we encourage users to think of DataFrames as "a dict of Series objects". Python always uses views for dictionary elements, so it's surprising if modifying one of these series does not change the original dataframe. I agree that copies are more intuitive (especially to R and MATLAB users), but views are an essential part of how Python works as a programming language. |
@shoyer I can respect that if that's how we go. Just want to put out that alternative for discussion. |
Just to add my two cents: the warning or exception also do not always really hold true. I have plenty of situations where this something like this generates a warning when starting with a hypothetical DataFrame org_df (just an example piece of code):
but it still delivers a correct result even with the warning (we actually implemented tests to continue checking for this). From memory management perspective we also usually like the smaller views... |
@CarstVaartjes Yes, that's a pretty common situation and one of the big reasons why I dislike SettingWithCopy warning. With the proposal to use copy-on-write, this would now work exactly the same, except without warnings. Note that indexing like |
@CarstVaartjes this was you are expecting?
|
@jreback yes exactly! |
@shoyer @jreback If we're agreed on the goal of having slices behave "as-if" copies, then I think there's a second situation we need to cover. @jreback commit ensures changes to a slice won't affect the original dataset, but we also need to ensure that changes to the original dataset won't propagate forward to slices. In other words, we need to ensure the following:
And not:
(Note that this behavior is actually currently inconsistent for the same reasons I don't know a lot about internals, but in this example does |
@nickeubank your last is not possible, we don't refcount things. Just getting the original copy-on-write to work is quite non-trivial. |
@jreback This is really pretty analogous to the |
I agree handling this case is important for copy-on-write to work as expected, but indeed I'm not sure how we could make that work without doing our own reference counting. This may be why copy-on-write is often a built in language feature. Hmm... On Wed, Sep 2, 2015 at 9:31 AM, Nick Eubank [email protected]
|
Perhaps this is a dumb question, but is there a reason that frames can't just keep a list of their "offspring" as an attribute and have subsetting functions ( |
Yes, this sort of thing could be done with weak references, though that On Wed, Sep 2, 2015 at 9:39 AM, Nick Eubank [email protected]
|
@shoyer well, I am already using weakrefs to track this at the moment, BUT what @nickeubank is proposing is significantly more complicated. |
Really cool! I also agree with most of the concerns raised in the original issue. This is really one of the difficult parts of pandas (that I sometimes even don't try to explain, together with the details of the |
I believe these are exactly the cases where users are currently setting SettingWithCopy warning. So in some sense, we are already warning about it... |
@shoyer True, but in such a case, this warning was actually a kind of false positive (as the original frame was adapted since it was a view and not a copy (like the warning says), and the warnings tries to warn for when that does not happen, no?). So people could have learnt to ignore it ... |
@shoyer @jorisvandenbossche Maybe I'm wrong, but I don't think that we currently warn when someone modified the ORIGINAL (parent) dataframe about the fact that effects on slices (which may or may not be views) are unpredictable, only when the changes are made to the SLICE (child). i.e. we currently warn against inconsistent backwards propagation of changes, but not inconsistent forward propagation. |
Hi everyone, has there been any progress on this issue? I feel like it's enough of a pain for me to start reading the source code and try to find a fix, but this Issue thread is more than 4 years old now, so I don't know if this discussion moved somewhere else or if the situation is gonna remain as it is. |
Sadly, I failed in my effort to fix it. I just couldn't reliably track all the relevant refs when objects were passed to constructors. My sense is this is gonna live on until the massive refactoring that's been on the board for years to move to arrow based data structures. |
Is this at all relevant to the accepted answer here? It seems the answer is old and something has changed. But seeing your comment above it appears it has not changed? |
@jorisvandenbossche is this closeable now that #46958 is merged? Copy-on-write is the planned default behavior in the future and will make these concerns moot, correct? |
It will be! what's the release timeline for this? Not seeing in linked issue... |
I believe copy-on-write is intended to become the default in pandas 2.0 but I could be mistaken there. I am not aware of a target date for the 2.0 release but I think it will be the next non-patch release. |
Indeed, with the changes in #46958 / #48998 (and formally accepted as PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html), the SettingWithCopyWarning will finally disappear. This is set to become the default (and only) behaviour in pandas 3.0, but you can already enable this future behaviour right now with:
|
As
pandas
approaches its 1.0 release, I would like to raise a concern about one aspect of thepandas
architecture that I think is a threat to its widespread adoption: howpandas
works with copies and views when setting values (what I will refer to here as theSettingWithCopyWarning
issue).The summary of my concern is the following:
Taking each of these in turn:
(1)
SettingWithCopyWarning
is a threat to data integrityThe fact that assignment operations do different things depending on whether the target is a view or a copy has already been recognized as a threat to the predictability of
pandas
. Indeed, the reason a warning was added is because users were consistently asking whypandas
was doing un-anticipated things whenSettingWithCopyWarning
came into play.(2) It is unreasonable to expect the average user to avoid a
SettingWithCopyWarning
issue, as doing so requires keeping track of the plethora of factors that determine what generates a copy and what generates a view.Figuring out when a function will return a copy and when it will return a view in
pandas
is not simple. Indeed, thepandas
documentation doesn't even try to explain when each will occur (link http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=views#indexing-view-versus-copy):(2a) Views made sense in
numpy
, but not inpandas
Views entered the pandas lexicon via
numpy
. But the reason they were so useful innumpy
is that they were predictable becausenumpy
arrays are always single-typed. Inpandas
, no such consistent, predictable behavior exists.(2b) Chain-indexing is a much more subtle problem than suggested in the
pandas
docsAt first glance, the
pandas
docs suggest that theSettingWithCopyWarning
is easily avoided by avoiding chain-indexing or using.loc
. This, I fear, is misleading for two reasos. First, the canonical example of chain indexing in the docs (dfmi['one']['second'] = value
) seems to suggest that one can avoid chain indexing by just not falling into the trap of this kind of double slicing. The problem, however, is that these slices need not appear near one another. I know I've had trouble with code of the form:Moreover, using
.loc
only solves this problem if one notices the chained indexing and attempts to fix it in one place. Just consistently using.loc[]
(for example, in both the first and second problematic slicings above) would not solve the problem.(3) Given (1) and (2), data integrity in
pandas
relies on users noticing a non-exception warning in the flow of their output.This seems really problematic. If a users is printing values as they go along (which CS developers may not do, but interactive casual users often do to monitor the progress of their code), these warnings are easy to miss. And that seems very dangerous.
(4) Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate alienate lots of potential users of
pandas
I suspect I come to
pandas
from a different perspective than many developers. I am an economist and political scientist who has gotten deeper and deeper into computer science over the past several years for instrumental purposes. As a result, I think I have a pretty good sense of how applied users approach something likepandas
, and I can just see this peculiarity ofpandas
driving this class of users batty. I've taken a year of computer science course work and am one of the most technically trained social scientists I know, and it drives me batty.It's also unpythonic -- the behavior of basic operators (like
=
) should not depend on the type of columns in aDataFrame
. Python 3 changed the behavior of the/
operator because it was felt the behavior of/
should not do depend on whether you were working withfloat
s orint
s. Since whether functions return a view or copy is in large part (but not exclusively) a function of whether aDataFrame
is single or multi-typed (which occurs when some columns arefloats
and some areints
), we have the same problem -- the operation of a basic operation (=
) depends on data types.In other words, if one of the aims of
pandas
is to essentially surplantR
among applied data scientists, then I think this is a major threat to achieving that goal.(5) I think solutions can be found that would have only limited effects on performance for the majority of users
pandas
uses views because they're so damn fast, so I understand the reluctance to drop them, but I think there are ways to minimize the performance hit. Obviously more talented developers with a better understanding of the innards ofpandas
may have better suggestions, but hopefully this can get the ball rolling.pandas
is a brilliant tool, and a huge improvement on everything else out there. I am eager to see it becomes the standard not only among python users, but among data analysts more broadly. Hopefully, by addressing this issue, we can help make this happen.With that in mind, I would like to suggest the need for two things:
pandas
internals to take this on alone, and this is likely too big an undertaking for any one individual anyone, so a team is likely to be necessary.The text was updated successfully, but these errors were encountered: