Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump pandas from 1.5.3 to 2.0.3 #422

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- updating listing file with three v2 sparse model - by @dhrubo-os ([#412](https://github.com/opensearch-project/opensearch-py-ml/pull/412))
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#417](https://github.com/opensearch-project/opensearch-py-ml/pull/417))
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-v2-distill (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#419](https://github.com/opensearch-project/opensearch-py-ml/pull/419))
- Bump pandas from 1.5.3 to 2.0.3 by @yerzhaisang ([#422](https://github.com/opensearch-project/opensearch-py-ml/pull/422))

### Fixed
- Fix the wrong final zip file name in model_uploader workflow, now will name it by the upload_prefix alse.([#413](https://github.com/opensearch-project/opensearch-py-ml/pull/413/files))
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
opensearch-py>=2
pandas>=1.5,<3
pandas==2.0.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we upgrade to a more latest version? any reason specifically for 2.0.3?

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping to a later version introduces datatype issues, including an ImportError like this:
ImportError: cannot import name 'is_datetime_or_timedelta_dtype' from 'pandas.core.dtypes.common'
Given that the issue was only to upgrade to a 2.x version, I thought 2.0.3 would be sufficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's focus on bumping to 2.0.3 for now and then we can create another issue to upgrade more if needed.

matplotlib>=3.6.0,<4
nbval
sphinx
Expand Down
25 changes: 22 additions & 3 deletions opensearch_py_ml/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,33 @@


def build_pd_series(
data: Dict[str, Any], dtype: Optional["DTypeLike"] = None, **kwargs: Any
data: Dict[str, Any],
dtype: Optional["DTypeLike"] = None,
index_name: Optional[str] = None,
**kwargs: Any,
) -> pd.Series:
"""Builds a pd.Series while squelching the warning
for unspecified dtype on empty series
"""
Builds a pandas Series from a dictionary, optionally setting an index name.

Parameters:
data : Dict[str, Any]
The data to build the Series from, with keys as the index.
dtype : Optional[DTypeLike]
The desired data type of the Series. If not specified, uses EMPTY_SERIES_DTYPE if data is empty.
index_name : Optional[str]
Name to assign to the Series index, similar to `index_name` in `value_counts`.

Returns:
pd.Series
A pandas Series constructed from the given data, with the specified dtype and index name.
"""

dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype)
if dtype is not None:
kwargs["dtype"] = dtype
if index_name is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

index = pd.Index(data.keys(), name=index_name)
kwargs["index"] = index
return pd.Series(data, **kwargs)


Expand Down
19 changes: 12 additions & 7 deletions opensearch_py_ml/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
from opensearch_py_ml.groupby import DataFrameGroupBy
from opensearch_py_ml.ndframe import NDFrame
from opensearch_py_ml.series import Series
from opensearch_py_ml.utils import is_valid_attr_name
from opensearch_py_ml.utils import is_valid_attr_name, to_list_if_needed

if TYPE_CHECKING:
from opensearchpy import OpenSearch
Expand Down Expand Up @@ -424,9 +424,14 @@ def drop(
axis = pd.DataFrame._get_axis_name(axis)
axes = {axis: labels}
elif index is not None or columns is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of confused here the parent branch is checking that if one of them is not None but inside its checking again
Line 431 and 440
maybe this could simplified to what @pyek-bot stated about creating a wrapper for convertToList if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

axes, _ = pd.DataFrame()._construct_axes_from_arguments(
(index, columns), {}
)
axes = {
"index": to_list_if_needed(index),
"columns": (
pd.Index(to_list_if_needed(columns))
if columns is not None
else None
Comment on lines +430 to +432

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in the implementation of

opensearch_py_ml.utils that when the value is None that you would return None already maybe we dont need a ternary operation here since its already doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Brian,

We can’t remove the ternary operation at this point because calling pd.Index(None) results in a TypeError. Additionally, we need to keep this check since it’s required here.

),
}
else:
raise ValueError(
"Need to specify at least one of 'labels', 'index' or 'columns'"
Expand All @@ -440,7 +445,7 @@ def drop(
axes["index"] = [axes["index"]]
if errors == "raise":
# Check if axes['index'] values exists in index
count = self._query_compiler._index_matches_count(axes["index"])
count = self._query_compiler._index_matches_count(list(axes["index"]))
if count != len(axes["index"]):
raise ValueError(
f"number of labels {count}!={len(axes['index'])} not contained in axis"
Expand Down Expand Up @@ -1326,7 +1331,7 @@ def to_csv(
compression="infer",
quoting=None,
quotechar='"',
line_terminator=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored it as lineterminator to align with recent pandas updates, but it’s still not actively used elsewhere in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep as it is: line_terminator? not lineterminator?

lineterminator=None,
chunksize=None,
tupleize_cols=None,
date_format=None,
Expand Down Expand Up @@ -1355,7 +1360,7 @@ def to_csv(
"compression": compression,
"quoting": quoting,
"quotechar": quotechar,
"line_terminator": line_terminator,
"lineterminator": lineterminator,
"chunksize": chunksize,
"date_format": date_format,
"doublequote": doublequote,
Expand Down
7 changes: 4 additions & 3 deletions opensearch_py_ml/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from typing import TYPE_CHECKING, List, Optional, Union

from opensearch_py_ml.query_compiler import QueryCompiler
from opensearch_py_ml.utils import MEAN_ABSOLUTE_DEVIATION, STANDARD_DEVIATION, VARIANCE

if TYPE_CHECKING:
import pandas as pd # type: ignore
Expand Down Expand Up @@ -153,7 +154,7 @@ def var(self, numeric_only: bool = True) -> "pd.DataFrame":
"""
return self._query_compiler.aggs_groupby(
by=self._by,
pd_aggs=["var"],
pd_aggs=[VARIANCE],
dropna=self._dropna,
numeric_only=numeric_only,
)
Expand Down Expand Up @@ -206,7 +207,7 @@ def std(self, numeric_only: bool = True) -> "pd.DataFrame":
"""
return self._query_compiler.aggs_groupby(
by=self._by,
pd_aggs=["std"],
pd_aggs=[STANDARD_DEVIATION],
dropna=self._dropna,
numeric_only=numeric_only,
)
Expand Down Expand Up @@ -259,7 +260,7 @@ def mad(self, numeric_only: bool = True) -> "pd.DataFrame":
"""
return self._query_compiler.aggs_groupby(
by=self._by,
pd_aggs=["mad"],
pd_aggs=[MEAN_ABSOLUTE_DEVIATION],
dropna=self._dropna,
numeric_only=numeric_only,
)
Expand Down
40 changes: 31 additions & 9 deletions opensearch_py_ml/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
SizeTask,
TailTask,
)
from opensearch_py_ml.utils import MEAN_ABSOLUTE_DEVIATION, STANDARD_DEVIATION, VARIANCE

if TYPE_CHECKING:
from numpy.typing import DTypeLike
Expand Down Expand Up @@ -475,7 +476,7 @@ def _terms_aggs(
except IndexError:
name = None

return build_pd_series(results, name=name)
return build_pd_series(results, index_name=name, name="count")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come its using "count" here but before it was using name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas 2.0.3, a change in the value_counts method resulted in the following behavior:

The method now uses "count" as the name for the values column, while the original column name (e.g., "Carrier") is used for the index name. This differs from earlier versions, where the values column would inherit the name of the original column.
pd_153
pd_203

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yerzhaisang could you please share any documentation about the changing behavior of value_counts method from version 1.5.3 to 2.0.3. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas 1.5.3, the series name is used for the result series.

In pandas 2.0.3, the series name is used for the index name, while the result series name is set to this one.


def _hist_aggs(
self, query_compiler: "QueryCompiler", num_bins: int
Expand Down Expand Up @@ -620,7 +621,7 @@ def _unpack_metric_aggs(
values.append(field.nan_value)
# Explicit condition for mad to add NaN because it doesn't support bool
elif is_dataframe_agg and numeric_only:
if pd_agg == "mad":
if pd_agg == MEAN_ABSOLUTE_DEVIATION:
values.append(field.nan_value)
continue

Expand Down Expand Up @@ -1097,7 +1098,14 @@ def _map_pd_aggs_to_os_aggs(
"""
# pd aggs that will be mapped to os aggs
# that can use 'extended_stats'.
extended_stats_pd_aggs = {"mean", "min", "max", "sum", "var", "std"}
extended_stats_pd_aggs = {
"mean",
"min",
"max",
"sum",
VARIANCE,
STANDARD_DEVIATION,
}
extended_stats_os_aggs = {"avg", "min", "max", "sum"}
extended_stats_calls = 0

Expand All @@ -1117,15 +1125,15 @@ def _map_pd_aggs_to_os_aggs(
os_aggs.append("avg")
elif pd_agg == "sum":
os_aggs.append("sum")
elif pd_agg == "std":
elif pd_agg == STANDARD_DEVIATION:
os_aggs.append(("extended_stats", "std_deviation"))
elif pd_agg == "var":
elif pd_agg == VARIANCE:
os_aggs.append(("extended_stats", "variance"))

# Aggs that aren't 'extended_stats' compatible
elif pd_agg == "nunique":
os_aggs.append("cardinality")
elif pd_agg == "mad":
elif pd_agg == MEAN_ABSOLUTE_DEVIATION:
os_aggs.append("median_absolute_deviation")
elif pd_agg == "median":
os_aggs.append(("percentiles", (50.0,)))
Expand Down Expand Up @@ -1205,7 +1213,7 @@ def describe(self, query_compiler: "QueryCompiler") -> pd.DataFrame:

df1 = self.aggs(
query_compiler=query_compiler,
pd_aggs=["count", "mean", "std", "min", "max"],
pd_aggs=["count", "mean", "min", "max", STANDARD_DEVIATION],
numeric_only=True,
)
df2 = self.quantile(
Expand All @@ -1219,8 +1227,22 @@ def describe(self, query_compiler: "QueryCompiler") -> pd.DataFrame:
# Convert [.25,.5,.75] to ["25%", "50%", "75%"]
df2 = df2.set_index([["25%", "50%", "75%"]])

return pd.concat([df1, df2]).reindex(
["count", "mean", "std", "min", "25%", "50%", "75%", "max"]
df = pd.concat([df1, df2])

# Note: In recent pandas versions, `describe()` returns a different index order
# for one-column DataFrames compared to multi-column DataFrames.
# We adjust the order manually to ensure consistency.
if df.shape[1] == 1:
# For single-column DataFrames, `describe()` typically outputs:
# ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]
return df.reindex(
["count", "mean", STANDARD_DEVIATION, "min", "25%", "50%", "75%", "max"]
)

# For multi-column DataFrames, `describe()` typically outputs:
# ["count", "mean", "min", "25%", "50%", "75%", "max", "std"]
return df.reindex(
["count", "mean", "min", "25%", "50%", "75%", "max", STANDARD_DEVIATION]
)

def to_pandas(
Expand Down
7 changes: 4 additions & 3 deletions opensearch_py_ml/query_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
from opensearch_py_ml.filter import BooleanFilter, QueryFilter
from opensearch_py_ml.index import Index
from opensearch_py_ml.operations import Operations
from opensearch_py_ml.utils import MEAN_ABSOLUTE_DEVIATION, STANDARD_DEVIATION, VARIANCE

if TYPE_CHECKING:
from opensearchpy import OpenSearch
Expand Down Expand Up @@ -587,17 +588,17 @@ def mean(self, numeric_only: Optional[bool] = None) -> pd.Series:

def var(self, numeric_only: Optional[bool] = None) -> pd.Series:
return self._operations._metric_agg_series(
self, ["var"], numeric_only=numeric_only
self, [VARIANCE], numeric_only=numeric_only
)

def std(self, numeric_only: Optional[bool] = None) -> pd.Series:
return self._operations._metric_agg_series(
self, ["std"], numeric_only=numeric_only
self, [STANDARD_DEVIATION], numeric_only=numeric_only
)

def mad(self, numeric_only: Optional[bool] = None) -> pd.Series:
return self._operations._metric_agg_series(
self, ["mad"], numeric_only=numeric_only
self, [MEAN_ABSOLUTE_DEVIATION], numeric_only=numeric_only
)

def median(self, numeric_only: Optional[bool] = None) -> pd.Series:
Expand Down
3 changes: 2 additions & 1 deletion opensearch_py_ml/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -312,11 +312,12 @@ def value_counts(self, os_size: int = 10) -> pd.Series:

>>> df = oml.DataFrame(OPENSEARCH_TEST_CLIENT, 'flights')
>>> df['Carrier'].value_counts()
Carrier
Logstash Airways 3331
JetBeats 3274
Kibana Airlines 3234
ES-Air 3220
Name: Carrier, dtype: int64
Name: count, dtype: int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we don't want, right? Carrier is the column name which we are changing it to count and that's not right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dhrubo, this is actually correct. In pandas 2.0.3, Carrier is set as the index name, and count is the column name.

"""
if not isinstance(os_size, int):
raise TypeError("os_size must be a positive integer.")
Expand Down
48 changes: 48 additions & 0 deletions opensearch_py_ml/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,14 @@
from typing import Any, Callable, Collection, Iterable, List, TypeVar, Union, cast

import pandas as pd # type: ignore
from pandas.core.dtypes.common import is_list_like # type: ignore

RT = TypeVar("RT")

MEAN_ABSOLUTE_DEVIATION = "mad"
VARIANCE = "var"
STANDARD_DEVIATION = "std"


def deprecated_api(
replace_with: str,
Expand Down Expand Up @@ -61,6 +66,29 @@ def is_valid_attr_name(s: str) -> bool:
)


def to_list_if_needed(value):
"""
Converts the input to a list if necessary.

If the input is a pandas Index, it converts it to a list.
If the input is not list-like (e.g., a single value), it wraps it in a list.
If the input is None or already list-like, it returns it as is.

Parameters:
value: The input to potentially convert to a list.

Returns:
The input converted to a list if needed, or the original input if no conversion is necessary.
"""
if value is None:
return None
if isinstance(value, pd.Index):
return value.tolist()
if not is_list_like(value):
return [value]
return value


def to_list(x: Union[Collection[Any], pd.Series]) -> List[Any]:
if isinstance(x, ABCCollection):
return list(x)
Expand All @@ -77,3 +105,23 @@ def try_sort(iterable: Iterable[str]) -> Iterable[str]:
return sorted(listed)
except TypeError:
return listed


class CustomFunctionDispatcher:
# Define custom functions in a dictionary
customFunctionMap = {
MEAN_ABSOLUTE_DEVIATION: lambda x: (x - x.median()).abs().mean(),
}

@classmethod
def apply_custom_function(cls, func, data):
"""
Apply a custom function if available, else return None.
:param func: Function name as a string
:param data: Data on which function is applied
:return: Result of custom function or None if func not found
"""
custom_func = cls.customFunctionMap.get(func)
if custom_func:
return custom_func(data)
return None
2 changes: 1 addition & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#
# Basic requirements
#
pandas>=1.5.2,<2
pandas==2.0.3
matplotlib>=3.6.2,<4
numpy>=1.24.0,<2
opensearch-py>=2.2.0
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#
# Basic requirements
#
pandas>=1.5.2,<2
pandas==2.0.3
matplotlib>=3.6.2,<4
numpy>=1.24.0,<2
opensearch-py>=2.2.0
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@
},
install_requires=[
"opensearch-py>=2",
"pandas>=1.5,<3",
"pandas==2.0.3",
"matplotlib>=3.6.0,<4",
"numpy>=1.24.0,<2",
"deprecated>=1.2.14,<2",
Expand Down
Loading
Loading