DEPR: make_block #56422

jbrockmendel · 2023-12-09T02:46:51Z

mroeschke · 2023-12-09T19:29:42Z

doc/source/whatsnew/v2.2.0.rst

@@ -431,6 +431,7 @@ Other Deprecations
 ^^^^^^^^^^^^^^^^^^
 - Changed :meth:`Timedelta.resolution_string` to return ``h``, ``min``, ``s``, ``ms``, ``us``, and ``ns`` instead of ``H``, ``T``, ``S``, ``L``, ``U``, and ``N``, for compatibility with respective deprecations in frequency aliases (:issue:`52536`)
 - Deprecated :func:`pandas.api.types.is_interval` and :func:`pandas.api.types.is_period`, use ``isinstance(obj, pd.Interval)`` and ``isinstance(obj, pd.Period)`` instead (:issue:`55264`)
+- Deprecated :func:`pd.core.internals.api.make_block`, use public APIs instead (:issue:`40226`)


Curious, is there a public API to create a block?

No. The point is to ween downstream packages off of our internals

mroeschke · 2023-12-11T17:43:48Z

Thanks @jbrockmendel

jorisvandenbossche · 2023-12-11T20:21:58Z

Sorry, I think this needs to be reverted. As I mentioned in the referenced issue #40226, pyarrow is using this. And as Brock mentioned above, there is no alternative for this. We can't deprecate this if we don't provide an alternative (and I think make_block actually already is the alternative to directly using the Block constructors, avoiding more usage of internal details).

jbrockmendel · 2023-12-11T22:47:00Z

Eventually we need to ween pyarrow off of both blocks and managers. Let's see if we can find a viable (i.e. perf hit not-too-big) alternative using public APIs.

An example to use as a benchmark:

import pandas as pd
import numpy as np
from pandas.core.internals.api import make_block
from pandas.core.internals import create_block_manager_from_blocks

ncols = 10
nrows = 10**5

arr1 = np.random.randn(nrows*ncols).reshape(nrows, ncols)
arr2 = arr1.astype(np.int64)

index = pd.Index(range(nrows))
cols = pd.Index(range(ncols * 2))
axes = [cols, index]

def v1():
    locs1 = np.arange(arr1.shape[1])
    blk1 = make_block(arr1.T, placement=locs1)
    locs2 = np.arange(arr1.shape[1], arr1.shape[1] + arr2.shape[1])
    blk2 = make_block(arr2.T, placement=locs2)
    mgr = create_block_manager_from_blocks([blk1, blk2], axes, verify_integrity=False)
    return pd.DataFrame._from_mgr(mgr, axes=axes[::-1])

def v2():
    left = pd.DataFrame(arr1, columns=cols[:ncols])
    right = pd.DataFrame(arr2, columns=cols[ncols:])
    return pd.concat([left, right], axis=1, copy=False)

def v3():
    arrs = list(arr1.T) + list(arr2.T)
    return pd.DataFrame._from_arrays(arrs, columns=cols, index=index, verify_integrity=False)

# check that these are all equivalent
df1 = v1()
df2 = v2()
df3 = v3()
assert df1.equals(df2)
assert df1.equals(df3)

%timeit v1()
23.2 µs ± 460 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit v2()
136 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit v3()
371 µs ± 7.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit v4()
198 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Notes:

I did this timing with the deprecation in this PR commented-out; I'd expect v1 to be slightly slower on main
v3 silently consolidates
Except for v3, these numbers are all unchanged(ish) if we scale nrows

The pd.concat version looks entirely viable to me. Yes it is 6x slower, but since it is O(1) we're just talking about 110µs which is just not that big of a deal.

A compromise would be to implement pd.DataFrame._from_2d_arrays_just_for_pyarrow which at least would not touch internals.

jreback · 2023-12-11T23:42:06Z

how do these scale in terms of columns

and what if we have multiple dtypes?

jorisvandenbossche · 2023-12-12T08:29:16Z

The v2 with concat will need a reindex of the columns to be fully equivalent (in real world, the columns with different dtypes are interleaved). I don't directly know how you can currently achieve that without a copy with public APIs (adding a .reindex(columns=cols, copy=False) makes this a lot slower, because it does an actual take on the block values. I don't know how to convince it that it can do a slice with the public reindex interface)

jbrockmendel · 2023-12-12T16:48:04Z

Once CoW is enabled the reindex is zero-copy

jorisvandenbossche · 2023-12-12T16:54:20Z

No, I was already passing copy=False, it's our implementation that needs to figure out it doesn't need to do a take

jbrockmendel · 2023-12-12T17:05:20Z

Good to know (cc @phofl sounds like something is wrong with CoW+take). Until that is fixed, DataFrame._getitem_nocopy exists for pretty much this purpose.

jorisvandenbossche · 2023-12-12T17:11:05Z

_getitem_nocopy will give a fragmented DataFrame, though (at least the current implementation)

jbrockmendel · 2023-12-12T17:18:23Z

wasn't the original reason for caring about fragmentation that we used to do silent consolidation, which we no longer do?

jorisvandenbossche · 2023-12-12T22:00:25Z

We no longer do silent consolidation throughout operations, but our constructors generally still give you consolidated results (and so does the from arrow constructor right now), because AFAIK that is still the most optimal layout in the general case.

And there is also not much point in first creating blocks if you then slice them up into many pieces

jbrockmendel · 2023-12-13T01:03:42Z

I have a hard time taking seriously a concern about fragmented dataframes being inefficient.

You don't like any of the alternatives or compromises I've suggested. Can you suggest some?

This reverts commit b0ffccd.

jorisvandenbossche · 2023-12-13T13:16:40Z

For me the compromise is that we only expose a small core subset (single entry point to create blocks, and create a manager from those blocks) of the internals for advanced users, and consider all other internals as fully private. So for example, we limit access to the actual Block classes and their class constructors (i.e. I merged your PR #55139 for that, and I also made corresponding updates to pyarrow to follow that), but we keep a single factory function to create a block (i.e. revert this PR -> #56481)

That is also what you described in the top post of the referenced issue #40226 (and make_block, what you are deprecating here, was exactly added for that purpose).

phofl · 2023-12-14T22:09:24Z

The v2 with concat will need a reindex of the columns to be fully equivalent (in real world, the columns with different dtypes are interleaved). I don't directly know how you can currently achieve that without a copy with public APIs (adding a .reindex(columns=cols, copy=False) makes this a lot slower, because it does an actual take on the block values. I don't know how to convince it that it can do a slice with the public reindex interface)

This confuses me a little. I have a couple of questions here:

I am assuming that your columns within a specific dtype are already in the correct order? E.g. the first int column comes before the second int column and then the third and so on, then the next set of columns is from dtype float and they are again ordered correctly within the float dtype?

This should already be zero copy with CoW, because your block placements should always be basically range(0, nr_of_columns), which CoW will convert to a slice.

Example:

df = pd.DataFrame({"a": [1, 2, 3], "d": 2, "e": 3, "b": 1.5, "c": 2.5})
result = df.reindex(columns=["a", "b", "c", "d", "e"], copy=False)

The int columns are all ordered correctly, same for the float columns, we just switch orders between the dtypes, which we can do zero copy

np.shares_memory(get_array(df, "a"), get_array(result, "a"))

Non CoW will still copy

jorisvandenbossche · 2023-12-15T07:58:17Z

OK, I see what happened! I tested this before (with a slightly updated example from Brocks post to interleave the dtypes), and timed with the default mode (around 10ms instead of 20µs), and based on the profile this was because of the Block.take_nd in the reindex step. Then I enabled CoW and it also gave around 10ms, but didn't check the profile (just assumed it had the same cause, also because I was already passing copy=False to reindex assuming we would honor that regardless of CoW).

But redoing that example now, the reason it was also slower with CoW is because I was missing a copy=False in the DataFrame(arr) constructor (where we changed the default to True when CoW is enabled for ndarray input), and apparently that gave more or less the same slowdown as the take in the non-CoW mode ..

With correcting that, I indeed get a zero-copy construction with the correct column order. The overhead in this case is around 400µs vs 20µs for me.
This (concat index and reindexing) overhead grows with the number of columns, and I will also try to check next week the impact of more dtypes, to see if the overhead stays acceptable.

Adapted example:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: from pandas.core.internals.api import make_block
   ...: from pandas.core.internals import create_block_manager_from_blocks
   ...: 
   ...: ncols = 10
   ...: nrows = 10**5
   ...: 
   ...: arr1 = np.random.randn(nrows*ncols).reshape(nrows, ncols)
   ...: arr2 = arr1.astype(np.int64)
   ...: 
   ...: index = pd.Index(range(nrows))
   ...: cols = pd.Index(range(ncols * 2))
   ...: axes = [cols, index]
<ipython-input-1-5c31cca232e4>:4: DeprecationWarning: create_block_manager_from_blocks is deprecated and will be removed in a future version. Use public APIs instead.
  from pandas.core.internals import create_block_manager_from_blocks

In [2]: def v1():
   ...:     locs1 = np.arange(arr1.shape[1]*2, step=2)
   ...:     blk1 = make_block(arr1.T, placement=locs1)
   ...:     locs2 = np.arange(1, arr1.shape[1]*2, step=2)
   ...:     blk2 = make_block(arr2.T, placement=locs2)
   ...:     mgr = create_block_manager_from_blocks([blk1, blk2], axes, verify_integrity=False)
   ...:     return pd.DataFrame._from_mgr(mgr, axes=axes[::-1])
   ...: 

In [3]: def v2():
   ...:     left = pd.DataFrame(arr1, columns=cols[0:ncols*2:2], copy=False)
   ...:     right = pd.DataFrame(arr2, columns=cols[1:ncols*2:2], copy=False)
   ...:     return pd.concat([left, right], axis=1, copy=False).reindex(columns=cols, copy=False)
   ...: 

In [4]: %timeit v1()
24.6 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: %timeit v2()
16.1 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: df1 = v1()

In [7]: df2 = v2()

In [8]: pd.testing.assert_frame_equal(df1, df2)

In [9]: pd.options.mode.copy_on_write = True

In [10]: %timeit v2()
463 µs ± 8.93 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

jorisvandenbossche · 2023-12-21T21:56:11Z

I haven't had the time this week to further test it. But in any case, this are relevant alternatives for after 3.0, so I am planning to merge #56481 for now

jbrockmendel · 2023-12-22T01:39:02Z

The alternatives being relevant for after 3.0 does not change the fact that the deprecation needs to happen before 3.0. Please do not self-merge a controversial PR.

…

On Thu, Dec 21, 2023 at 1:56 PM Joris Van den Bossche < ***@***.***> wrote: I haven't had the time this week to further test it. But in any case, this are relevant alternatives for after 3.0, so I am planning to merge #56481 <#56481> for now — Reply to this email directly, view it on GitHub <#56422 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6CT6BVZVBE2AVHAP2LYKSWAPAVCNFSM6AAAAABANOXAIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWHE3TSNJSHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jorisvandenbossche · 2023-12-22T07:47:43Z

Can you then explain why this deprecation needs to happen now before 3.0? It's not deprecating anything that we want to change, it's just deprecating access to something we prefer people not to use (but that access is already private anyway, and as far as we know, pyarrow (~~and mars, based on a github code search~~ (it was a different function)) is the only public users of this).
My understanding is that there is nothing time sensitive about this deprecation (we wouldn't be able to already remove it in 3.0 anyway, even if deprecated now for 2.2, as that would be way too fast)

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

And I want to point out that it is this PR that is controversial, merged within 2 days after opening it, and not in line what was discussed on the issue it references (#40226). So I won't self-merge, but I would appreciate if someone would merge the revert (#56481), so we can can take the time discuss this properly, without the pressure of it already being merged and an upcoming release.

jbrockmendel · 2023-12-22T19:10:43Z

Can you then explain why this deprecation needs to happen now before 3.0?

Not getting it in for 2.x means waiting another year before actually enforcing it.

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

That is going to be the case regardless of when the deprecation actually occurs. There is zero evidence of pyarrow being willing to change their usage without a deprecation.

Also to add to the list of alternatives above: just use pd.ArrowDtype which makes way more sense anyway.

jorisvandenbossche · 2024-01-08T19:35:03Z

Can you then explain why this deprecation needs to happen now before 3.0?

Not getting it in for 2.x means waiting another year before actually enforcing it.

There is no way that this could be enforced in 3.0 anyhow, as it would break all released versions of pyarrow (a dependency we want to make required for 3.0 ..). So a change like this that breaks a (required) dependency has to be spread over more than a year anyaway, I think.

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

That is going to be the case regardless of when the deprecation actually occurs.

No, that is not correct. The main alternative being discussed above (concat + reindex) will be possible starting in the future (3.0, when CoW is enabled), but not right now.

Also to add to the list of alternatives above: just use pd.ArrowDtype which makes way more sense anyway.

I don't think that make sense at the moment. PyArrow's conversion to pandas follows pandas' own choices in defaults, and for the rest it is not opinionated.

The fact that this conversion lives in PyArrow is (I think, this is from before my time) mostly historical and for packaging reasons. Because it was easier when pyarrow was young and evolving, and because a dependency in pandas on Arrow C++ would be annoying with Python's state of packaging. But you could perfectly argue that this code rather belongs in pandas, and at that point we would also just be using the APIs that pyarrow now uses.

jbrockmendel · 2024-01-08T20:46:06Z

There is no way that this could be enforced in 3.0 anyhow

The idea is to have this be a DeprecationWarning in 2.x, then a FutureWarning in 3.x, then enforced in 4.0.

rhshadrach · 2024-01-08T21:03:10Z

Can this be a DeprecationWarning in e.g. 3.0, a FutureWarning in 3.1, and then enforced in 4.0?

jorisvandenbossche · 2024-01-08T21:22:44Z

See also my comment at #56481 (comment) about the timeline.

It's difficult to talk exactly about version numbers, because we don't know the exact timeline for those releases. If pandas 4.0 would happen a year after 3.0 (like 3.0 is now planned a year after 2.0), then I think 4.0 would be too early for enforcing it.
But maybe 4.0 happens a half year or year later, and then that would be fine.

I also don't think this time of enforcement is that important for pandas. It's not like a user facing API where we want to change the behaviour, and want to get the better behaviour as fast as reasonably possible.

jorisvandenbossche · 2024-01-08T21:24:49Z

But to answer your actual question: even if we could already enforce it in 4.0, I think it is indeed perfectly fine to only have the DeprecationWarning in 3.0 (instead of now in 2.2) and change from Deprecation to FutureWarning somewhere between 3.0 and 4.0.

This reverts commit b0ffccd.

)"

…") (#56814) Backport PR #56481: Revert "DEPR: make_block (#56422)" Co-authored-by: Joris Van den Bossche <[email protected]>

This reverts commit b0ffccd.

jbrockmendel added 2 commits December 8, 2023 18:46

DEPR: make_block

a644dc0

lint fixup

73d465b

mroeschke reviewed Dec 9, 2023

View reviewed changes

mroeschke added Internals Related to non-user accessible pandas implementation Deprecate Functionality to remove in pandas labels Dec 9, 2023

mroeschke added this to the 2.2 milestone Dec 11, 2023

mroeschke approved these changes Dec 11, 2023

View reviewed changes

mroeschke merged commit b0ffccd into pandas-dev:main Dec 11, 2023
50 checks passed

github-actions bot mentioned this pull request Dec 11, 2023

DEPR: List of deprecations to be removed in 3.0 #50578

Open

jbrockmendel deleted the depr-make_block branch December 11, 2023 17:44

jorisvandenbossche added a commit that referenced this pull request Dec 13, 2023

Revert "DEPR: make_block (#56422)"

8f7e496

This reverts commit b0ffccd.

jorisvandenbossche mentioned this pull request Dec 13, 2023

Revert "DEPR: make_block (#56422)" #56481

Merged

lithomas1 mentioned this pull request Dec 27, 2023

DOC: Add whatsnew for 2.3.0 #56634

Merged

5 tasks

lithomas1 pushed a commit that referenced this pull request Jan 10, 2024

Revert "DEPR: make_block (#56422)" (#56481)

1acf0d6

This reverts commit b0ffccd.

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 10, 2024

Backport PR pandas-dev#56481: Revert "DEPR: make_block (pandas-dev#56422

fc3d938

)"

meeseeksmachine mentioned this pull request Jan 10, 2024

Backport PR #56481 on branch 2.2.x (Revert "DEPR: make_block (#56422)") #56814

Merged

lithomas1 pushed a commit that referenced this pull request Jan 11, 2024

Backport PR #56481 on branch 2.2.x (Revert "DEPR: make_block (#56422)…

59c0a2d

…") (#56814) Backport PR #56481: Revert "DEPR: make_block (#56422)" Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche mentioned this pull request Mar 29, 2024

DEPR: make_block #56815

Closed

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

Revert "DEPR: make_block (pandas-dev#56422)" (pandas-dev#56481)

776d57d

This reverts commit b0ffccd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: make_block #56422

DEPR: make_block #56422

jbrockmendel commented Dec 9, 2023

mroeschke Dec 9, 2023

jbrockmendel Dec 9, 2023

mroeschke commented Dec 11, 2023

jorisvandenbossche commented Dec 11, 2023

jbrockmendel commented Dec 11, 2023

jreback commented Dec 11, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023 •

edited

Loading

jbrockmendel commented Dec 13, 2023

jorisvandenbossche commented Dec 13, 2023

phofl commented Dec 14, 2023

jorisvandenbossche commented Dec 15, 2023 •

edited

Loading

jorisvandenbossche commented Dec 21, 2023

jbrockmendel commented Dec 22, 2023 via email

jorisvandenbossche commented Dec 22, 2023 •

edited

Loading

jbrockmendel commented Dec 22, 2023

jorisvandenbossche commented Jan 8, 2024

jbrockmendel commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

jorisvandenbossche commented Jan 8, 2024

jorisvandenbossche commented Jan 8, 2024

DEPR: make_block #56422

DEPR: make_block #56422

Conversation

jbrockmendel commented Dec 9, 2023

mroeschke Dec 9, 2023

Choose a reason for hiding this comment

jbrockmendel Dec 9, 2023

Choose a reason for hiding this comment

mroeschke commented Dec 11, 2023

jorisvandenbossche commented Dec 11, 2023

jbrockmendel commented Dec 11, 2023

jreback commented Dec 11, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023

jbrockmendel commented Dec 12, 2023

jorisvandenbossche commented Dec 12, 2023 • edited Loading

jbrockmendel commented Dec 13, 2023

jorisvandenbossche commented Dec 13, 2023

phofl commented Dec 14, 2023

jorisvandenbossche commented Dec 15, 2023 • edited Loading

jorisvandenbossche commented Dec 21, 2023

jbrockmendel commented Dec 22, 2023 via email

jorisvandenbossche commented Dec 22, 2023 • edited Loading

jbrockmendel commented Dec 22, 2023

jorisvandenbossche commented Jan 8, 2024

jbrockmendel commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

jorisvandenbossche commented Jan 8, 2024

jorisvandenbossche commented Jan 8, 2024

jorisvandenbossche commented Dec 12, 2023 •

edited

Loading

jorisvandenbossche commented Dec 15, 2023 •

edited

Loading

jorisvandenbossche commented Dec 22, 2023 •

edited

Loading