Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: pandas.core for groupby #55429

Closed
wants to merge 3 commits into from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Oct 6, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Part of #27522.

Just a PoC for right now to see what the CI says. Various parts of this are stolen from numpy/numpy#24634.

Not sure if we want to do this all at once or submodule by submodule. I'm thinking doing it one submodule at a time is easier for review and we can try just e.g. groupby first and see how that goes.

All commits prior to "Automated changes" only need to be done the first time. The "Automated changes" commit is the result of the script below. A few manual changes need to be made after.

WARNING: The script below will completely clean out your git repo by doing a git reset --hard and git clean -xfd

Automated changes
import os
import sys
import subprocess
from pathlib import Path

def run_in_shell(cmd):
    print(cmd)
    response = subprocess.run(cmd, shell=True, capture_output=True)
    stdout_str = response.stdout.decode("utf-8").strip()
    if stdout_str != '':
        print(stdout_str)
    stderr_str = response.stderr.decode("utf-8").strip()
    if stderr_str != '':
        raise ValueError(stderr_str)

base_path = Path("/home/richard/dev/pandas/pandas")
submodule = "groupby"

run_in_shell(f"cd {base_path} && git reset --hard HEAD")
run_in_shell(f"cd {base_path} && git clean -xfd")
os.makedirs(base_path / "_core" / submodule, exist_ok=True)
run_in_shell(f"cd {base_path} && git mv core/{submodule}/* _core/{submodule}/")

old = f'core\.{submodule}'
new = f'_core\.{submodule}'
run_in_shell(f"cd {base_path} && grep -rl '{old}' . --exclude-dir=core/{submodule} | xargs sed -i 's/{old}/{new}/g'")
run_in_shell(f"cd {base_path}/../ci && grep -rl '{old}' . | xargs sed -i 's/{old}/{new}/g'")
run_in_shell(f"cd {base_path}/../doc && grep -rl '{old}' . | xargs sed -i 's/{old}/{new}/g'")
run_in_shell(f"cd {base_path}/../ && sed -i 's/{old}/{new}/g' pyproject.toml")

old = f'from pandas\.core import {submodule}'
new = f'from pandas\._core import {submodule}'
run_in_shell(f"cd {base_path} && grep -rl '{old}' . --exclude-dir=core/{submodule} | xargs sed -i 's/{old}/{new}/g'")

old = f'pandas\/core\/{submodule}'
new = f'pandas\/_core\/{submodule}'
run_in_shell(f"cd {base_path}/../ && sed -i 's/{old}/{new}/g' pyright_reportGeneralTypeIssues.json")

for filename in (base_path / "_core" / submodule).glob("**/*"):
    if not str(filename).endswith(".py"):
        continue
    name = filename.parts[-1][:-len(".py")]
    output_path = str(filename).replace("/_core/", "/core/")

    if name == "__init__":
        contents = f"""from __future__ import annotations
        
from typing import Any

from pandas._core import {submodule} as {submodule}_
from pandas.core.common import _depr_core


def __getattr__(attr_name: str) -> Any:
    attr = getattr({submodule}_, attr_name)
    _depr_core()
    return attr
"""

    else:
        contents = f"""from __future__ import annotations
from pandas._core.{submodule} import {name}
from pandas.core.common import _depr_core

_depr_core()

_globals = globals()

for item in {name}.__dir__():
    _globals[item] = getattr({name}, item)
"""

    with open(output_path, mode="w") as f:
        f.write(contents)

run_in_shell(f"cd {base_path} && git add core/{submodule}/*")

@rhshadrach rhshadrach added API Design Deprecate Functionality to remove in pandas Groupby labels Oct 6, 2023
@jbrockmendel
Copy link
Member

Haven't looked closely, but big picture this seems like something we ought to do eventually and might as well bite the bullet.

@rhshadrach
Copy link
Member Author

Haven't looked closely, but big picture this seems like something we ought to do eventually and might as well bite the bullet.

I think my latest commits may have messed up the diff, and that we can get it down to < 1000 lines when done properly. I'm hoping I can automate a bit more and will be force-pushing quite a bit here. Will ping when ready.

@rhshadrach
Copy link
Member Author

Looks like my previous comment was wrong - I think because we need to add new content to the old location, GitHub's diff isn't registering this as a move. Bummer.

@jbrockmendel
Copy link
Member

kind of orthogonal to this but i want to write it down somewhere: the naming in core.groupby is weird to me bc generic.py is for the non-base classes whereas core.generic is for the base class

@mroeschke
Copy link
Member

Just noting that this is how NumPy is deprecating their core usage: https://github.com/numpy/numpy/blob/main/numpy/core/_utils.py

@rhshadrach
Copy link
Member Author

@mroeschke: happy to make any changes if desired. Because this diff is so large, here is the one I have (in pandas.core.common._depr_core):

warnings.warn(
        "pandas.core is deprecated and has been renamed to "
        "pandas._core. Accessing `_core` directly is discouraged as "
        "members can change without warning.  You should use a public module "
        "instead that exports the attribute in question. If you still would "
        "like to access an attribute from it, please use pandas._core.",
        DeprecationWarning,
        stacklevel=3,
    )

@rhshadrach
Copy link
Member Author

One other open question in my mind: do we do all of core at once, or go (roughly) submodule-by-submodule over time (but all within 2.2)?

@rhshadrach rhshadrach force-pushed the depr_core_groupby branch 2 times, most recently from aaa1061 to deeb7e0 Compare October 17, 2023 22:12
@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 18, 2023

@jbrockmendel @mroeschke I think this is ready for review. Most of the diff comes from the Automated changes commit, produced by the script in the OP.

If we don't do import pandas.core.groupby in pandas/__init__.py, then we wouldn't be able to do e.g. from pandas.core.groupby.groupby import GroupBy. I'm still trying to figure out a better way for this, but maybe it's okay?

@rhshadrach rhshadrach marked this pull request as ready for review October 18, 2023 00:23
@rhshadrach rhshadrach requested a review from phofl October 21, 2023 16:54
@rhshadrach
Copy link
Member Author

Friendly ping @jbrockmendel @mroeschke @phofl

from typing import Any

from pandas._core import groupby as groupby_
from pandas.core.common import _depr_core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, will this raise it's own DeprecationWarning if this is imported from pandas.core?

Copy link
Member Author

@rhshadrach rhshadrach Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we deprecate pandas.core.common, I think we'd move this method to pandas._core.common. So yes - but we won't be importing it from pandas.core.common ourselves.

- `reduction_kernels`
- `transformation_kernels`
- `groupby_other_methods`
see the comments in pandas/core/groupby/base.py for guidance on
see the comments in pandas/_core.groupby/base.py for guidance on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
see the comments in pandas/_core.groupby/base.py for guidance on
see the comments in pandas/_core/groupby/base.py for guidance on

Looks like there's one below (and possibly elsewhere)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - my script for automated changes was doing

old = f'core.{submodule}'
new = f'_core.{submodule}'

whereas it should have been \. in the regex. Will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed.

from pandas.core.common import _depr_core


def __getattr__(attr_name: str) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dir(pandas.core.groupby) is called, would we get the same result before/after this change?

Copy link
Member Author

@rhshadrach rhshadrach Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - I'm getting:

['Any', '__builtins__', '__cached__', '__doc__', '__file__', '__getattr__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_depr_core', 'annotations', 'groupby_']

I don't immediately see a manageable way to have the dir results not change, will think on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the NumPy deprecation (link in OP), I think they have the same issue.

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 28, 2023

@mroeschke - friendly ping. Docs build only failed because of the SciPy URL issue.

At this point, I'd think about holding off on this until 2.2 is out. I would like to start on this early in that release cycle though, so good to get any further issues ironed out.

My preference would be to do just one submodule and wait a week or two with it in main before moving on, then do the rest of core all in one go. But I'd be okay with doing the entire thing piece-meal in a series of PRs if others prefer.

@jbrockmendel
Copy link
Member

My preference would be to do just one submodule and wait a week or two with it in main before moving on, then do the rest of core all in one go. But I'd be okay with doing the entire thing piece-meal in a series of PRs if others prefer.

No opinion here, strong default to defer to the person doing the actual work: you.

IIRC dask has a place where they do something like

import pandas as pd

@inherit_docstring_or_something(pd.core.groupby.DataFrameGroupBy)
class DataFrameGroupBy:

that might be affected by this (I think i ran into this when trying to make groupby import lazy)

@@ -85,7 +85,7 @@ Plotting
GroupBy/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

- Fixed regression in :meth:`pands.core.groupby.DataFrameGroupBy.quantile` raising when multiple quantiles are given (:issue:`27526`)
- Fixed regression in :meth:`pands._core.groupby.DataFrameGroupBy.quantile` raising when multiple quantiles are given (:issue:`27526`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just trim the modules? also typo pands->pandas

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I handled most of these in #55555 and #55626, but I think this one escaped my grepping because of the typo. Will do another precursor.

@@ -0,0 +1,304 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes git understands when a file is just renamed. not sure how that happens. can we make it happen here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no if we want to do this all in one shot. I believe the issue is that we are renaming but then also recreating the files with different content, and git doesn't recognize this. If we were to split this up into two PRs: (1) rename core -> _core and then (2) add back the core files for backwards compatibility / deprecation, then I think git would recognize it.

Breaking it up into these two steps would certainly make review easier, but we'd need to make sure we carry it all out before the release.

@phofl
Copy link
Member

phofl commented Dec 1, 2023

IIRC dask has a place where they do something like

Yes that's correct, but that doesn't affect only dask, it affects our Sphinx docs as well (they will show private paths). Richard and I chatted about this a while ago and IIRC having the classes accessible in the typing api module would solve that (we can move dasks imports over)

@rhshadrach is this a correct recollection of our chat?

@jbrockmendel
Copy link
Member

id want to get @jorisvandenbossche's OK on this before moving forward

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 14, 2023

During the dev call today, @jorisvandenbossche suggested just doing the move without the deprecation first. It also seems to me it'd be better to do this as two PRs: one moving core to _core followed by a second that adds the stub files. I will be putting PRs for this approach up in place of this one.

@rhshadrach rhshadrach closed this Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants