-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: make_block #56422
DEPR: make_block #56422
Conversation
@@ -431,6 +431,7 @@ Other Deprecations | |||
^^^^^^^^^^^^^^^^^^ | |||
- Changed :meth:`Timedelta.resolution_string` to return ``h``, ``min``, ``s``, ``ms``, ``us``, and ``ns`` instead of ``H``, ``T``, ``S``, ``L``, ``U``, and ``N``, for compatibility with respective deprecations in frequency aliases (:issue:`52536`) | |||
- Deprecated :func:`pandas.api.types.is_interval` and :func:`pandas.api.types.is_period`, use ``isinstance(obj, pd.Interval)`` and ``isinstance(obj, pd.Period)`` instead (:issue:`55264`) | |||
- Deprecated :func:`pd.core.internals.api.make_block`, use public APIs instead (:issue:`40226`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, is there a public API to create a block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The point is to ween downstream packages off of our internals
Thanks @jbrockmendel |
Sorry, I think this needs to be reverted. As I mentioned in the referenced issue #40226, pyarrow is using this. And as Brock mentioned above, there is no alternative for this. We can't deprecate this if we don't provide an alternative (and I think |
Eventually we need to ween pyarrow off of both blocks and managers. Let's see if we can find a viable (i.e. perf hit not-too-big) alternative using public APIs. An example to use as a benchmark:
Notes:
The pd.concat version looks entirely viable to me. Yes it is 6x slower, but since it is O(1) we're just talking about 110µs which is just not that big of a deal. A compromise would be to implement pd.DataFrame._from_2d_arrays_just_for_pyarrow which at least would not touch internals. |
how do these scale in terms of columns and what if we have multiple dtypes? |
The |
Once CoW is enabled the reindex is zero-copy |
No, I was already passing |
Good to know (cc @phofl sounds like something is wrong with CoW+take). Until that is fixed, DataFrame._getitem_nocopy exists for pretty much this purpose. |
|
wasn't the original reason for caring about fragmentation that we used to do silent consolidation, which we no longer do? |
We no longer do silent consolidation throughout operations, but our constructors generally still give you consolidated results (and so does the from arrow constructor right now), because AFAIK that is still the most optimal layout in the general case. And there is also not much point in first creating blocks if you then slice them up into many pieces |
I have a hard time taking seriously a concern about fragmented dataframes being inefficient. You don't like any of the alternatives or compromises I've suggested. Can you suggest some? |
This reverts commit b0ffccd.
For me the compromise is that we only expose a small core subset (single entry point to create blocks, and create a manager from those blocks) of the internals for advanced users, and consider all other internals as fully private. So for example, we limit access to the actual Block classes and their class constructors (i.e. I merged your PR #55139 for that, and I also made corresponding updates to pyarrow to follow that), but we keep a single factory function to create a block (i.e. revert this PR -> #56481) That is also what you described in the top post of the referenced issue #40226 (and |
This confuses me a little. I have a couple of questions here: I am assuming that your columns within a specific dtype are already in the correct order? E.g. the first int column comes before the second int column and then the third and so on, then the next set of columns is from dtype float and they are again ordered correctly within the float dtype? This should already be zero copy with CoW, because your block placements should always be basically range(0, nr_of_columns), which CoW will convert to a slice. Example:
The int columns are all ordered correctly, same for the float columns, we just switch orders between the dtypes, which we can do zero copy
Non CoW will still copy |
OK, I see what happened! I tested this before (with a slightly updated example from Brocks post to interleave the dtypes), and timed with the default mode (around 10ms instead of 20µs), and based on the profile this was because of the Block.take_nd in the reindex step. Then I enabled CoW and it also gave around 10ms, but didn't check the profile (just assumed it had the same cause, also because I was already passing But redoing that example now, the reason it was also slower with CoW is because I was missing a With correcting that, I indeed get a zero-copy construction with the correct column order. The overhead in this case is around 400µs vs 20µs for me. Adapted example:
|
I haven't had the time this week to further test it. But in any case, this are relevant alternatives for after 3.0, so I am planning to merge #56481 for now |
The alternatives being relevant for after 3.0 does not change the fact that
the deprecation needs to happen before 3.0. Please do not self-merge a
controversial PR.
…On Thu, Dec 21, 2023 at 1:56 PM Joris Van den Bossche < ***@***.***> wrote:
I haven't had the time this week to further test it. But in any case, this
are relevant alternatives for after 3.0, so I am planning to merge #56481
<#56481> for now
—
Reply to this email directly, view it on GitHub
<#56422 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6CT6BVZVBE2AVHAP2LYKSWAPAVCNFSM6AAAAABANOXAIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWHE3TSNJSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Can you then explain why this deprecation needs to happen now before 3.0? It's not deprecating anything that we want to change, it's just deprecating access to something we prefer people not to use (but that access is already private anyway, and as far as we know, pyarrow ( PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now. And I want to point out that it is this PR that is controversial, merged within 2 days after opening it, and not in line what was discussed on the issue it references (#40226). So I won't self-merge, but I would appreciate if someone would merge the revert (#56481), so we can can take the time discuss this properly, without the pressure of it already being merged and an upcoming release. |
Not getting it in for 2.x means waiting another year before actually enforcing it.
That is going to be the case regardless of when the deprecation actually occurs. There is zero evidence of pyarrow being willing to change their usage without a deprecation. Also to add to the list of alternatives above: just use pd.ArrowDtype which makes way more sense anyway. |
There is no way that this could be enforced in 3.0 anyhow, as it would break all released versions of pyarrow (a dependency we want to make required for 3.0 ..). So a change like this that breaks a (required) dependency has to be spread over more than a year anyaway, I think.
No, that is not correct. The main alternative being discussed above (
I don't think that make sense at the moment. PyArrow's conversion to pandas follows pandas' own choices in defaults, and for the rest it is not opinionated. The fact that this conversion lives in PyArrow is (I think, this is from before my time) mostly historical and for packaging reasons. Because it was easier when pyarrow was young and evolving, and because a dependency in pandas on Arrow C++ would be annoying with Python's state of packaging. But you could perfectly argue that this code rather belongs in pandas, and at that point we would also just be using the APIs that pyarrow now uses. |
The idea is to have this be a DeprecationWarning in 2.x, then a FutureWarning in 3.x, then enforced in 4.0. |
Can this be a DeprecationWarning in e.g. 3.0, a FutureWarning in 3.1, and then enforced in 4.0? |
See also my comment at #56481 (comment) about the timeline. It's difficult to talk exactly about version numbers, because we don't know the exact timeline for those releases. If pandas 4.0 would happen a year after 3.0 (like 3.0 is now planned a year after 2.0), then I think 4.0 would be too early for enforcing it. I also don't think this time of enforcement is that important for pandas. It's not like a user facing API where we want to change the behaviour, and want to get the better behaviour as fast as reasonably possible. |
But to answer your actual question: even if we could already enforce it in 4.0, I think it is indeed perfectly fine to only have the DeprecationWarning in 3.0 (instead of now in 2.2) and change from Deprecation to FutureWarning somewhere between 3.0 and 4.0. |
…") (#56814) Backport PR #56481: Revert "DEPR: make_block (#56422)" Co-authored-by: Joris Van den Bossche <[email protected]>
This reverts commit b0ffccd.
xref #40226