-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump pandas from 1.5.3 to 2.0.3 #422
base: main
Are you sure you want to change the base?
Changes from all commits
fb4f0d6
06076d9
3ef1e2c
da14ee5
c68b7f8
0084cce
12bacb7
77ebf1c
45e793a
c4455e0
7135507
d2203be
c55639d
b1f2319
77c56a7
25905b6
feeb0b5
1162608
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
opensearch-py>=2 | ||
pandas>=1.5,<3 | ||
pandas==2.0.3 | ||
matplotlib>=3.6.0,<4 | ||
nbval | ||
sphinx | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,14 +55,33 @@ | |
|
||
|
||
def build_pd_series( | ||
data: Dict[str, Any], dtype: Optional["DTypeLike"] = None, **kwargs: Any | ||
data: Dict[str, Any], | ||
dtype: Optional["DTypeLike"] = None, | ||
index_name: Optional[str] = None, | ||
**kwargs: Any, | ||
) -> pd.Series: | ||
"""Builds a pd.Series while squelching the warning | ||
for unspecified dtype on empty series | ||
""" | ||
Builds a pandas Series from a dictionary, optionally setting an index name. | ||
|
||
Parameters: | ||
data : Dict[str, Any] | ||
The data to build the Series from, with keys as the index. | ||
dtype : Optional[DTypeLike] | ||
The desired data type of the Series. If not specified, uses EMPTY_SERIES_DTYPE if data is empty. | ||
index_name : Optional[str] | ||
Name to assign to the Series index, similar to `index_name` in `value_counts`. | ||
|
||
Returns: | ||
pd.Series | ||
A pandas Series constructed from the given data, with the specified dtype and index name. | ||
""" | ||
|
||
dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype) | ||
if dtype is not None: | ||
kwargs["dtype"] = dtype | ||
if index_name is not None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a comment why do we need this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
index = pd.Index(data.keys(), name=index_name) | ||
kwargs["index"] = index | ||
return pd.Series(data, **kwargs) | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,7 +47,7 @@ | |
from opensearch_py_ml.groupby import DataFrameGroupBy | ||
from opensearch_py_ml.ndframe import NDFrame | ||
from opensearch_py_ml.series import Series | ||
from opensearch_py_ml.utils import is_valid_attr_name | ||
from opensearch_py_ml.utils import is_valid_attr_name, to_list_if_needed | ||
|
||
if TYPE_CHECKING: | ||
from opensearchpy import OpenSearch | ||
|
@@ -424,9 +424,14 @@ def drop( | |
axis = pd.DataFrame._get_axis_name(axis) | ||
axes = {axis: labels} | ||
elif index is not None or columns is not None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kind of confused here the parent branch is checking that if one of them is not None but inside its checking again There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
axes, _ = pd.DataFrame()._construct_axes_from_arguments( | ||
(index, columns), {} | ||
) | ||
axes = { | ||
"index": to_list_if_needed(index), | ||
"columns": ( | ||
pd.Index(to_list_if_needed(columns)) | ||
if columns is not None | ||
else None | ||
Comment on lines
+430
to
+432
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noticed that in the implementation of opensearch_py_ml.utils that when the value is None that you would return None already maybe we dont need a ternary operation here since its already doing that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey Brian, We can’t remove the ternary operation at this point because calling |
||
), | ||
} | ||
else: | ||
raise ValueError( | ||
"Need to specify at least one of 'labels', 'index' or 'columns'" | ||
|
@@ -440,7 +445,7 @@ def drop( | |
axes["index"] = [axes["index"]] | ||
if errors == "raise": | ||
# Check if axes['index'] values exists in index | ||
count = self._query_compiler._index_matches_count(axes["index"]) | ||
count = self._query_compiler._index_matches_count(list(axes["index"])) | ||
if count != len(axes["index"]): | ||
raise ValueError( | ||
f"number of labels {count}!={len(axes['index'])} not contained in axis" | ||
|
@@ -1326,7 +1331,7 @@ def to_csv( | |
compression="infer", | ||
quoting=None, | ||
quotechar='"', | ||
line_terminator=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we removing this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I restored it as lineterminator to align with recent pandas updates, but it’s still not actively used elsewhere in the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we keep as it is: |
||
lineterminator=None, | ||
chunksize=None, | ||
tupleize_cols=None, | ||
date_format=None, | ||
|
@@ -1355,7 +1360,7 @@ def to_csv( | |
"compression": compression, | ||
"quoting": quoting, | ||
"quotechar": quotechar, | ||
"line_terminator": line_terminator, | ||
"lineterminator": lineterminator, | ||
"chunksize": chunksize, | ||
"date_format": date_format, | ||
"doublequote": doublequote, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,6 +65,7 @@ | |
SizeTask, | ||
TailTask, | ||
) | ||
from opensearch_py_ml.utils import MEAN_ABSOLUTE_DEVIATION, STANDARD_DEVIATION, VARIANCE | ||
|
||
if TYPE_CHECKING: | ||
from numpy.typing import DTypeLike | ||
|
@@ -475,7 +476,7 @@ def _terms_aggs( | |
except IndexError: | ||
name = None | ||
|
||
return build_pd_series(results, name=name) | ||
return build_pd_series(results, index_name=name, name="count") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How come its using "count" here but before it was using name? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In pandas 2.0.3, a change in the value_counts method resulted in the following behavior: The method now uses "count" as the name for the values column, while the original column name (e.g., "Carrier") is used for the index name. This differs from earlier versions, where the values column would inherit the name of the original column. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yerzhaisang could you please share any documentation about the changing behavior of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In pandas 1.5.3, the series name is used for the result series. In pandas 2.0.3, the series name is used for the index name, while the result series name is set to this one. |
||
|
||
def _hist_aggs( | ||
self, query_compiler: "QueryCompiler", num_bins: int | ||
|
@@ -620,7 +621,7 @@ def _unpack_metric_aggs( | |
values.append(field.nan_value) | ||
# Explicit condition for mad to add NaN because it doesn't support bool | ||
elif is_dataframe_agg and numeric_only: | ||
if pd_agg == "mad": | ||
if pd_agg == MEAN_ABSOLUTE_DEVIATION: | ||
values.append(field.nan_value) | ||
continue | ||
|
||
|
@@ -1097,7 +1098,14 @@ def _map_pd_aggs_to_os_aggs( | |
""" | ||
# pd aggs that will be mapped to os aggs | ||
# that can use 'extended_stats'. | ||
extended_stats_pd_aggs = {"mean", "min", "max", "sum", "var", "std"} | ||
extended_stats_pd_aggs = { | ||
"mean", | ||
"min", | ||
"max", | ||
"sum", | ||
VARIANCE, | ||
STANDARD_DEVIATION, | ||
} | ||
extended_stats_os_aggs = {"avg", "min", "max", "sum"} | ||
extended_stats_calls = 0 | ||
|
||
|
@@ -1117,15 +1125,15 @@ def _map_pd_aggs_to_os_aggs( | |
os_aggs.append("avg") | ||
elif pd_agg == "sum": | ||
os_aggs.append("sum") | ||
elif pd_agg == "std": | ||
elif pd_agg == STANDARD_DEVIATION: | ||
os_aggs.append(("extended_stats", "std_deviation")) | ||
elif pd_agg == "var": | ||
elif pd_agg == VARIANCE: | ||
os_aggs.append(("extended_stats", "variance")) | ||
|
||
# Aggs that aren't 'extended_stats' compatible | ||
elif pd_agg == "nunique": | ||
os_aggs.append("cardinality") | ||
elif pd_agg == "mad": | ||
elif pd_agg == MEAN_ABSOLUTE_DEVIATION: | ||
os_aggs.append("median_absolute_deviation") | ||
elif pd_agg == "median": | ||
os_aggs.append(("percentiles", (50.0,))) | ||
|
@@ -1205,7 +1213,7 @@ def describe(self, query_compiler: "QueryCompiler") -> pd.DataFrame: | |
|
||
df1 = self.aggs( | ||
query_compiler=query_compiler, | ||
pd_aggs=["count", "mean", "std", "min", "max"], | ||
pd_aggs=["count", "mean", "min", "max", STANDARD_DEVIATION], | ||
numeric_only=True, | ||
) | ||
df2 = self.quantile( | ||
|
@@ -1219,8 +1227,22 @@ def describe(self, query_compiler: "QueryCompiler") -> pd.DataFrame: | |
# Convert [.25,.5,.75] to ["25%", "50%", "75%"] | ||
df2 = df2.set_index([["25%", "50%", "75%"]]) | ||
|
||
return pd.concat([df1, df2]).reindex( | ||
["count", "mean", "std", "min", "25%", "50%", "75%", "max"] | ||
df = pd.concat([df1, df2]) | ||
|
||
# Note: In recent pandas versions, `describe()` returns a different index order | ||
# for one-column DataFrames compared to multi-column DataFrames. | ||
# We adjust the order manually to ensure consistency. | ||
if df.shape[1] == 1: | ||
# For single-column DataFrames, `describe()` typically outputs: | ||
# ["count", "mean", "std", "min", "25%", "50%", "75%", "max"] | ||
return df.reindex( | ||
["count", "mean", STANDARD_DEVIATION, "min", "25%", "50%", "75%", "max"] | ||
) | ||
|
||
# For multi-column DataFrames, `describe()` typically outputs: | ||
# ["count", "mean", "min", "25%", "50%", "75%", "max", "std"] | ||
return df.reindex( | ||
["count", "mean", "min", "25%", "50%", "75%", "max", STANDARD_DEVIATION] | ||
) | ||
|
||
def to_pandas( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -312,11 +312,12 @@ def value_counts(self, os_size: int = 10) -> pd.Series: | |
|
||
>>> df = oml.DataFrame(OPENSEARCH_TEST_CLIENT, 'flights') | ||
>>> df['Carrier'].value_counts() | ||
Carrier | ||
Logstash Airways 3331 | ||
JetBeats 3274 | ||
Kibana Airlines 3234 | ||
ES-Air 3220 | ||
Name: Carrier, dtype: int64 | ||
Name: count, dtype: int64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what we don't want, right? Carrier is the column name which we are changing it to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @dhrubo, this is actually correct. In pandas 2.0.3, Carrier is set as the index name, and count is the column name. |
||
""" | ||
if not isinstance(os_size, int): | ||
raise TypeError("os_size must be a positive integer.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we upgrade to a more latest version? any reason specifically for 2.0.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping to a later version introduces datatype issues, including an ImportError like this:
ImportError: cannot import name 'is_datetime_or_timedelta_dtype' from 'pandas.core.dtypes.common'
Given that the issue was only to upgrade to a 2.x version, I thought 2.0.3 would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's focus on bumping to
2.0.3
for now and then we can create another issue to upgrade more if needed.