Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add conda subpackages corresponding to pip extras #52490

Open
1 of 3 tasks
jamesmyatt opened this issue Apr 6, 2023 · 7 comments
Open
1 of 3 tasks

ENH: Add conda subpackages corresponding to pip extras #52490

jamesmyatt opened this issue Apr 6, 2023 · 7 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Apr 6, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

It would be good to be able to install extras along with pandas using conda as well as pip, since 2.0.0. For example:

conda install pandas-performance

should be equivalent to

python -m pip install pandas[performance]

This matches lots of other packages, such as matplotlib, seaborn, dvc, black, etc. e.g. https://github.com/conda-forge/matplotlib-feedstock/blob/main/recipe/meta.yaml, https://dvc.org/doc/install/linux#install-with-conda, https://github.com/conda-forge/black-feedstock/blob/main/recipe/meta.yaml

Feature Description

Use subpackages in https://github.com/conda-forge/pandas-feedstock/

Alternative Solutions

Current situation:

At the start of every project using conda (or when updating the requirements), the user must find the right part of the pandas docs, read it to work out the correct minimum version of optional dependencies they need, map the pypi package names to conda ones and then add those explicitly to their environment.yml file.

Additional Context

Suggest defining both pandas-base (or -core) to match pandas exactly, then pandas that just depends on pandas-base but could be expanded with recommended but not mandatory dependencies, plus all of the non-development and non-complete extras from pyproject.toml.

I can work on this PR for the pandas feedstock it's welcome.

update: Added alternative solution to describe what currently happens.

@jamesmyatt jamesmyatt added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2023
@youcanbekingagain
Copy link

(I am a beginner, it's my first issue) I am not sure what to do exactly
[(https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies)]
I can make changes in setup.py as -
extras_require = { 'plotter': ['matplotlib', 'seaborn', 'plotly'] }
and changes in pyproject.toml -
[project.optional-dependencies] performance = ["performance"]
I dont't understand for different commands pandas-core and pandas-base you asked about as I can see only one project in project.toml and also about the pandas feedstock

@DeaMariaLeon DeaMariaLeon added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2023
@lithomas1
Copy link
Member

Sorry, but I am -1 on this. I think we have way to many groups for this to be sustainable.

I would rather conda actually add proper support for this (xref conda/conda#11053).

I'll leave this issue open, though, if any other people want to give feedback.

@jamesmyatt
Copy link
Contributor Author

jamesmyatt commented Apr 11, 2023

Thanks for your comment @lithomas1 , but I'm not sure that waiting for conda to implement some significant new functionality is a viable alternative, when there is a well functioning, widely adopted alternative of using conda subpackages. Besides, I don't see that proposal avoiding duplication since you need to manually map pypi packages to conda ones anyway. And there's no single source of truth for the minimum dependency versions anyway since they're also duplicated in the docs: https://github.com/pandas-dev/pandas/blob/main/doc/source/getting_started/install.rst.

Like you, I also don't like the duplication between the pyproject file and the conda recipe, but much more than that I don't like having to go to the pandas docs every time I start a new project and work out the minimum versions of all of the dependencies I want and then write this in my environment.yml file.

  - pandas >=2.0.0
  - bottleneck >=1.3.4
  - numba >=0.55.2
  - numexpr >=2.8.0
  - pyarrow >=7.0.0
  - matplotlib-base >=3.6.1  # Note this is not matplotlib which includes more optional dependencies

when I could just write this instead

  - pandas-performance >=2.0.0
  - pandas-parquet >=2.0.0
  - pandas-plot >=2.0.0

Nor is ignoring conda users a good idea either. The pip extras were added in the pandas 2.0 for a very good reason: it saves a lot of people time and makes it much more user-friendly. #39164.

A more valuable package manager change would be to allow pip check to check arbitrary install specs, e.g. pip check pandas[performance] rather than just checking that the current environment is consistent, since then the tests in the recipe would be able to check against the right extras. But again, I don't see waiting for another slow-moving project to make changes as a viable strategy.

@lithomas1
Copy link
Member

Thanks for the feedback. I don't think my opinion has changed, though.

Thanks for your comment @lithomas1 , but I'm not sure that waiting for conda to implement some significant new functionality is a viable alternative, when there is a well functioning, widely adopted alternative of using conda subpackages. Besides, I don't see that proposal avoiding duplication since you need to manually map pypi packages to conda ones anyway. And there's no single source of truth for the minimum dependency versions anyway since they're also duplicated in the docs: https://github.com/pandas-dev/pandas/blob/main/doc/source/getting_started/install.rst.

Like you, I also don't like the duplication between the pyproject file and the conda recipe, but much more than that I don't like having to go to the pandas docs every time I start a new project and work out the minimum versions of all of the dependencies I want and then write this in my environment.yml file.

This doesn't solve the issue, but you can try looking through our CI env files (e.g. https://github.com/pandas-dev/pandas/blob/main/ci/deps/actions-38.yaml, they are all under the ci/deps folder). I believe, all dependencies there specify a minimum version (sans a couple).

  - pandas >=2.0.0
  - bottleneck >=1.3.4
  - numba >=0.55.2
  - numexpr >=2.8.0
  - pyarrow >=7.0.0
  - matplotlib-base >=3.6.1  # Note this is not matplotlib which includes more optional dependencies

when I could just write this instead

  - pandas-performance >=2.0.0
  - pandas-parquet >=2.0.0
  - pandas-plot >=2.0.0

My main gripe is that this clutters up the conda-forge channel (xref conda-forge/conda-forge.github.io#1558).

Nor is ignoring conda users a good idea either. The pip extras were added in the pandas 2.0 for a very good reason: it saves a lot of people time and makes it much more user-friendly. #39164.

I understand that this is annoying, but I don't think it's reasonable or sustainable to ask every project with conda packages and pip extras, to hack around the issue like this.

This is not a pandas-specific problem, but a conda problem, and I would like it fixed in the right place.

@jamesmyatt
Copy link
Contributor Author

jamesmyatt commented Aug 20, 2024

FWIW, I still think that this is a worthwhile problem to solve and that conda subpackages is the best solution. But I'm not sure there's ever going to be a solution to the duplication of information problem.

As I understand it, subpackages is the conda version of pip extras, rather than just being a workaround. i.e. it's no use wishing that you could write "pandas[performance]" as a conda dependency, when the proper conda solution would be "pandas-performance". But I might be wrong.

@h-vetinari
Copy link
Contributor

h-vetinari commented Aug 20, 2024

No, you're right, conda has no concept of extras, only separate outputs. There are efforts to change this, but even if this were to land tomorrow, we wouldn't be able to rely on this for a while yet.

In practice though, matching the extras in the conda-forge feedstock is not any more work (after the initial setup) than keeping the requirements in sync with whatever's specified in pyproject.toml.

@jamesmyatt
Copy link
Contributor Author

jamesmyatt commented Aug 20, 2024

No, you're right, conda has no concept of extras, only separate outputs.

The way I look at it, conda subpackages are the same concept as pip extras -- or at least you can implement the same concept as pip extras using conda subpackages. Just because they have different names and superficial syntax doesn't mean they aren't the same concept. That's just my opinion, but I think it matches @xhochy 's: conda/ceps#55 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants