Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Iterable and Sequence for index and columns parameters #790

Closed
dpoznik opened this issue Oct 5, 2023 · 7 comments
Closed

Allow Iterable and Sequence for index and columns parameters #790

dpoznik opened this issue Oct 5, 2023 · 7 comments

Comments

@dpoznik
Copy link
Contributor

dpoznik commented Oct 5, 2023

Describe the bug
mypy complains when index and columns parameters are passed Iterable or Sequence, but these are valid.

To Reproduce

  1. Provide a minimal runnable pandas example that is not properly checked by the stubs.
from typing import Iterable

import pandas as pd

def construct_series(
    data: Iterable[str],
    idx: Iterable[str],
) -> pd.Series:
    s = pd.Series(data, index=idx)
    return s
  1. Indicate which type checker you are using (mypy or pyright).
    mypy
  2. Show the error message received from that type checker while checking your example.
foo.py:12: error: No overload variant of "Series" matches argument types "Iterable[str]", "Iterable[str]"  [call-overload]
foo.py:12: note: Possible overload variants:
foo.py:12: note:     def [S1] Series(data: DatetimeIndex | Sequence[Timestamp | datetime64 | datetime], index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: Literal['datetime64[Y]', 'datetime64[M]', 'datetime64[W]', 'datetime64[D]', 'datetime64[h]', 'datetime64[m]', 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[μs]', 'datetime64[ns]', 'datetime64[ps]', 'datetime64[fs]', 'datetime64[as]'] = ..., name: Hashable | None = ..., copy: bool = ...) -> TimestampSeries
foo.py:12: note:     def [S1] Series(data: ExtensionArray | ndarray[Any, Any] | dict[str, ndarray[Any, Any]] | list[Any] | tuple[Any, ...] | Index, index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., *, dtype: Literal['datetime64[Y]', 'datetime64[M]', 'datetime64[W]', 'datetime64[D]', 'datetime64[h]', 'datetime64[m]', 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[μs]', 'datetime64[ns]', 'datetime64[ps]', 'datetime64[fs]', 'datetime64[as]'], name: Hashable | None = ..., copy: bool = ...) -> TimestampSeries
foo.py:12: note:     def [S1] Series(data: PeriodIndex, index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: PeriodDtype = ..., name: Hashable | None = ..., copy: bool = ...) -> PeriodSeries
foo.py:12: note:     def [S1] Series(data: TimedeltaIndex | Sequence[Timedelta | timedelta64 | timedelta], index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: Literal['timedelta64[Y]', 'timedelta64[M]', 'timedelta64[W]', 'timedelta64[D]', 'timedelta64[h]', 'timedelta64[m]', 'timedelta64[s]', 'timedelta64[ms]', 'timedelta64[us]', 'timedelta64[μs]', 'timedelta64[ns]', 'timedelta64[ps]', 'timedelta64[fs]', 'timedelta64[as]'] = ..., name: Hashable | None = ..., copy: bool = ...) -> TimedeltaSeries
foo.py:12: note:     def [S1, _OrderableT] Series(data: IntervalIndex[Interval[_OrderableT]] | Interval[_OrderableT] | Sequence[Interval[_OrderableT]], index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: Literal['Interval'] = ..., name: Hashable | None = ..., copy: bool = ...) -> IntervalSeries[_OrderableT]
foo.py:12: note:     def [S1] Series(data: object, dtype: type[S1], index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., name: Hashable | None = ..., copy: bool = ...) -> Series[S1]
foo.py:12: note:     def [S1] Series(data: Series[S1] | dict[int, S1] | dict[str, S1] = ..., index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: ExtensionDtype | str | dtype[generic] | type[str] | type[complex] | type[bool] | type[object] = ..., name: Hashable | None = ..., copy: bool = ...) -> Series[S1]
foo.py:12: note:     def [S1] Series(data: object = ..., index: Index | Series[Any] | ndarray[Any, Any] | list[Any] | dict[Any, Any] | range | tuple[Any, ...] | None = ..., dtype: ExtensionDtype | str | dtype[generic] | type[str] | type[complex] | type[bool] | type[object] = ..., name: Hashable | None = ..., copy: bool = ...) -> Series[Any]
Found 1 error in 1 file (checked 1 source file)

Please complete the following information:

  • OS: macOS
  • OS Version: Ventura 13.6
  • Python version: 3.11.3
  • version of type checker: mypy 1.5.1
  • version of installed pandas-stubs: 2.1.1.230928

Additional context
Would it be appropriate to update index: Axes | None (e.g., here) to something like index: Axes | Iterable | Sequence | None?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 6, 2023

Would it be appropriate to update index: Axes | None (e.g., here) to something like index: Axes | Iterable | Sequence | None?

No. Because we don't want to accept plain strings or sets as arguments, both of which would match your use of Iterable[str] .

So I think the type of your function is too wide - you are allowing arguments that are not accepted by pandas.

Side note - this is fixed for pd.DataFrame() with respect to sets, but your report uncovered that we didn't fix all the use cases, so I created a pandas issue here: pandas-dev/pandas#55425

I'm going to close this because I believe the behavior of the stubs is correct here, but if you can provide an example that works with pandas with direct arguments to pandas functions, I'm willing to reopen it.

@Dr-Irv Dr-Irv closed this as completed Oct 6, 2023
@dpoznik
Copy link
Contributor Author

dpoznik commented Oct 8, 2023

we don't want to accept plain strings or sets

Ah, of course. I completely agree that Iterable and Sequence are too broad, but I'm curious for your thoughts on KeysView and ValuesView? For example, the following runs fine:

import pandas as pd

def foo(data: list[int], d: dict[str, int]) -> pd.Series:
    s = pd.Series(data, index=d.keys())
    return s

if __name__ == '__main__':
    data =[1, 2, 3]
    d = {str(value): value for value in data} 
    print(foo(data, d))

But mypy doesn't like it:

foo.py:5: error: No overload variant of "Series" matches argument types "list[int]", "dict_keys[str, int]"  [call-overload]

Side note - this is fixed for pd.DataFrame() with respect to sets, but your report uncovered that we didn't fix all the use cases, so I created a pandas issue here: pandas-dev/pandas#55425

Ah, great. I'm glad my initial report was at least indirectly helpful :)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 9, 2023

but I'm curious for your thoughts on KeysView and ValuesView?

While they work, the docs say "arraylike" and KeysView and ValuesView are not "arraylike".

@dpoznik
Copy link
Contributor Author

dpoznik commented Oct 9, 2023

While they work, the docs say "arraylike" and KeysView and ValuesView are not "arraylike".

Got it. Thanks!

@dpoznik
Copy link
Contributor Author

dpoznik commented Oct 9, 2023

Since the pandas definition of ArrayLike is private, I'm still not sure how best to annotate a function whose argument will be passed to the index parameter. I tried numpy.typing.ArrayLike:

import numpy.typing as npt
import pandas as pd

def foo(data: list[int], idx: npt.ArrayLike) -> pd.Series:
    s = pd.Series(data, index=idx)
    return s

print(foo([1, 2, 3], ["a", "b", "c"]))

This runs, but mypy gives:

foo.py:5: error: No overload variant of "Series" matches argument types "list[int]", "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]"  [call-overload]

I could, of course, define a project-specific ArrayLike. Something like the following covers the use cases in my project and keeps mypy happy:

from typing import TypeAlias

import numpy as np
import numpy.typing as npt
import pandas as pd

ArrayLike: TypeAlias = list[str] | tuple[str, ...] | np.ndarray | pd.Index | pd.Series

def foo(
    data: list[int],
    idx: ArrayLike,
) -> pd.Series:
    s = pd.Series(data, index=idx)
    return s

print(foo([1, 2, 3], ["a", "b", "c"]))

Is that what you'd advise? Thanks again!

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 9, 2023

Is that what you'd advise? Thanks again!

The following would work:

import pandas as pd
from pandas._typing import Axes

def foo(
    data: list[int],
    idx: Axes,
) -> pd.Series:
    s = pd.Series(data, index=idx)
    return s


print(foo([1, 2, 3], ["a", "b", "c"]))

The Axes type will cover the acceptable types. While it's not documented, it's unlikely to change.

@dpoznik
Copy link
Contributor Author

dpoznik commented Oct 9, 2023

OK, great. Thanks again for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants