Bump pandas from 1.5.3 to 2.0.3 #422

Yerzhaisang · 2024-10-29T19:42:54Z

Description

Replaced built-in mad, _construct_axes_from_arguments, quantile, and min methods with custom implementations to ensure consistent behavior.
Removed deprecated arguments: check_less_precise and line_terminator.
Updated tests for the describe method to align with current functionality.
Adapted the value_counts method to match changes in the new pandas version, where the column name is now required for index calls.
Adjusted for differing index order behavior in pandas, accommodating both single-column and multi-column DataFrames within the OpenSearch implementation.

Issues Resolved

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Yerzhaisang Taskali <[email protected]>

pyek-bot · 2024-10-31T19:31:37Z

CHANGELOG.md

@@ -46,6 +46,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 - updating listing file with three v2 sparse model - by @dhrubo-os ([#412](https://github.com/opensearch-project/opensearch-py-ml/pull/412))
 - Update model upload history -  opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#417](https://github.com/opensearch-project/opensearch-py-ml/pull/417))
 - Update model upload history -  opensearch-project/opensearch-neural-sparse-encoding-v2-distill (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#419](https://github.com/opensearch-project/opensearch-py-ml/pull/419))
+- Bump pandas from 1.5.3 to 2.0.3 bu @yerzhaisang ([#422](https://github.com/opensearch-project/opensearch-py-ml/pull/422))


pyek-bot · 2024-10-31T19:34:47Z

docs/requirements-docs.txt

@@ -1,5 +1,5 @@
 opensearch-py>=2
-pandas>=1.5,<3
+pandas==2.0.3


can we upgrade to a more latest version? any reason specifically for 2.0.3?

Bumping to a later version introduces datatype issues, including an ImportError like this:
ImportError: cannot import name 'is_datetime_or_timedelta_dtype' from 'pandas.core.dtypes.common'
Given that the issue was only to upgrade to a 2.x version, I thought 2.0.3 would be sufficient.

Sure, let's focus on bumping to 2.0.3 for now and then we can create another issue to upgrade more if needed.

pyek-bot · 2024-10-31T19:41:12Z

opensearch_py_ml/common.py

 ) -> pd.Series:
    """Builds a pd.Series while squelching the warning
    for unspecified dtype on empty series
    """
    dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype)
    if dtype is not None:
        kwargs["dtype"] = dtype
+    if index_name:


lets keep an explicit check for None - if index_name is not None:

what happens if the index is not found?

If we pass an index, we see the column name for which we are counting values and how it works in pandas 2.0.3.

If the index is not found, we don’t see the column name, and an assertion error occurs when we compare the built-in value_counts() method with the one in pandas.

lets keep an explicit check for None - if index_name is not None:

Done

pyek-bot · 2024-10-31T19:43:31Z

opensearch_py_ml/dataframe.py

+            # axes, _ = pd.DataFrame()._construct_axes_from_arguments(
+            #     (index, columns), {}
+            # )


can remove these comments if no longer used

pyek-bot · 2024-10-31T19:44:36Z

opensearch_py_ml/dataframe.py

+            if index is not None:
+                if isinstance(index, pd.Index):
+                    index = index.tolist()  # Convert Index to list
+                elif not is_list_like(index):
+                    index = [index]  # Convert to list if it's not list-like already
+                axes["index"] = index
+            else:
+                axes["index"] = None
+
+            if columns is not None:
+                if isinstance(columns, pd.Index):
+                    columns = columns.tolist()  # Convert Index to list
+                elif not is_list_like(columns):
+                    columns = [columns]  # Convert to list if it's not list-like already
+                axes["columns"] = columns
+            else:
+                axes["columns"] = None


can this be wrapped in a method? and then we can do something like this

axes = { "index": to_list_if_needed(index), "columns": pd.Index(to_list_if_needed(columns)) if columns is not None else None }

pyek-bot · 2024-10-31T19:44:48Z

opensearch_py_ml/dataframe.py

+
+            if columns is not None:
+                if isinstance(columns, pd.Index):
+                    columns = columns.tolist()  # Convert Index to list


columns to list

pyek-bot · 2024-10-31T19:45:08Z

opensearch_py_ml/dataframe.py

+                if not is_list_like(columns):
+                    columns = [columns]


repeated logic? this is handled in lines 443 and 444 right?

brianf-aws · 2024-10-31T20:24:08Z

opensearch_py_ml/dataframe.py

@@ -424,9 +424,36 @@ def drop(
            axis = pd.DataFrame._get_axis_name(axis)
            axes = {axis: labels}
        elif index is not None or columns is not None:


Kind of confused here the parent branch is checking that if one of them is not None but inside its checking again
Line 431 and 440
maybe this could simplified to what @pyek-bot stated about creating a wrapper for convertToList if needed

brianf-aws

Hey left some comments. I notice you constantly reimplement the mad function. I would consider this a friction point if someone new decides to use mad and has to apply the lambda formula again. this is prone to errors I suggested to extract the same formula being applied into a utility class since its being repeated a lot.

What about the built_in mad function is causing inconsistent behavior is there an example you can provide?.

Also I feel a bit concerned you are choosing to use strictly 2.0.3 which kind of forces people to code using only that library can we refactor so that if we want to bump again not a lot of code changes are needed?

brianf-aws · 2024-10-31T20:28:18Z

opensearch_py_ml/operations.py

@@ -475,7 +475,7 @@ def _terms_aggs(
        except IndexError:
            name = None

-        return build_pd_series(results, name=name)
+        return build_pd_series(results, index_name=name, name="count")


How come its using "count" here but before it was using name?

In pandas 2.0.3, a change in the value_counts method resulted in the following behavior:

The method now uses "count" as the name for the values column, while the original column name (e.g., "Carrier") is used for the index name. This differs from earlier versions, where the values column would inherit the name of the original column.

Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang

@Yerzhaisang could you please share any documentation about the changing behavior of value_counts method from version 1.5.3 to 2.0.3. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.

In pandas 1.5.3, the series name is used for the result series.

In pandas 2.0.3, the series name is used for the index name, while the result series name is set to this one.

brianf-aws · 2024-10-31T20:36:11Z

tests/dataframe/test_groupby_pytest.py

@@ -106,10 +106,18 @@ def test_groupby_aggs_mad_var_std(self, pd_agg, dropna):
        pd_flights = self.pd_flights().filter(self.filter_data)
        oml_flights = self.oml_flights().filter(self.filter_data)

-        pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
+        if pd_agg == "mad":


Is it possible to extract these variables like "mad", "var", "std" as a constant. (can remove the friction to new comers) fro example let MEAN_ABSOLUTE_DEVATION = "mad". Not sure if that would take a lot of effort but something to think about

brianf-aws · 2024-10-31T20:38:04Z

tests/dataframe/test_groupby_pytest.py

-        pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
+        if pd_agg == "mad":
+            pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg(
+                lambda x: (x - x.mean()).abs().mean()


I see this lambda get used in other places, could we possible have a util class that dispatches theses common function names?

brianf-aws · 2024-10-31T20:47:51Z

tests/dataframe/test_metrics_pytest.py

-            pd_metric = getattr(pd_flights, func)(
-                **({"numeric_only": True} if func != "mad" else {})
-            )
+            if func == "mad":


I noticed you introduced branching what if some day in the future we need to reimplement another function from scratch instead of creating a new branch every time can we extract this behavior out? perhaps have a utility class with our own custom implementation like

customeFunctionMap = {"mad" : lambda x: (x - x.median()).abs().mean()}

and then dispatch it instead of branching for every new method we need to reimplent something like

if func in customFUnctionMap:
// apply the custom function
else:
// use the already given functionality.

Signed-off-by: Yerzhaisang Taskali <[email protected]>

…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>

Signed-off-by: Yerzhaisang Taskali <[email protected]>

brianf-aws

Only had one piece of feedback you could implement. Code structure, I think its good (Not a pandas expert though)

brianf-aws · 2024-11-05T00:35:16Z

opensearch_py_ml/dataframe.py

+                    pd.Index(to_list_if_needed(columns))
+                    if columns is not None
+                    else None


I noticed that in the implementation of

opensearch_py_ml.utils that when the value is None that you would return None already maybe we dont need a ternary operation here since its already doing that?

Hey Brian,

We can’t remove the ternary operation at this point because calling pd.Index(None) results in a TypeError. Additionally, we need to keep this check since it’s required here.

brianf-aws · 2024-11-05T00:38:09Z

tests/dataframe/test_groupby_pytest.py

+        if pd_agg in CustomFunctionDispatcher.customFunctionMap:
+            pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg(
+                lambda x: CustomFunctionDispatcher.apply_custom_function(pd_agg, x)


tests/dataframe/test_groupby_pytest.py

brianf-aws · 2024-11-05T00:45:23Z

opensearch_py_ml/operations.py

@@ -475,7 +475,7 @@ def _terms_aggs(
        except IndexError:
            name = None

-        return build_pd_series(results, name=name)
+        return build_pd_series(results, index_name=name, name="count")


Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang

dhrubo-os · 2024-11-05T01:24:41Z

docs/requirements-docs.txt

@@ -1,5 +1,5 @@
 opensearch-py>=2
-pandas>=1.5,<3
+pandas==2.0.3


Sure, let's focus on bumping to 2.0.3 for now and then we can create another issue to upgrade more if needed.

dhrubo-os · 2024-11-05T01:25:41Z

opensearch_py_ml/common.py

 ) -> pd.Series:
    """Builds a pd.Series while squelching the warning
    for unspecified dtype on empty series
    """
    dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype)
    if dtype is not None:
        kwargs["dtype"] = dtype
+    if index_name is not None:


Can we add a comment why do we need this?

dhrubo-os · 2024-11-05T01:27:17Z

opensearch_py_ml/constants.py

+#  specific language governing permissions and limitations
+#  under the License.
+
+MEAN_ABSOLUTE_DEVIATION = "mad"


Rather than creating another file, can we use utils.py

dhrubo-os · 2024-11-05T01:32:02Z

opensearch_py_ml/dataframe.py

@@ -1326,7 +1331,6 @@ def to_csv(
        compression="infer",
        quoting=None,
        quotechar='"',
-        line_terminator=None,


why are we removing this?

I restored it as lineterminator to align with recent pandas updates, but it’s still not actively used elsewhere in the code.

can we keep as it is: line_terminator? not lineterminator?

dhrubo-os · 2024-11-05T01:38:45Z

opensearch_py_ml/operations.py

@@ -475,7 +475,7 @@ def _terms_aggs(
        except IndexError:
            name = None

-        return build_pd_series(results, name=name)
+        return build_pd_series(results, index_name=name, name="count")


@Yerzhaisang could you please share any documentation about the changing behavior of value_counts method from version 1.5.3 to 2.0.3. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.

dhrubo-os · 2024-11-05T01:42:30Z

opensearch_py_ml/series.py

        Logstash Airways    3331
        JetBeats            3274
        Kibana Airlines     3234
        ES-Air              3220
-        Name: Carrier, dtype: int64
+        Name: count, dtype: int64


This is what we don't want, right? Carrier is the column name which we are changing it to count and that's not right.

Hey @dhrubo, this is actually correct. In pandas 2.0.3, Carrier is set as the index name, and count is the column name.

dhrubo-os · 2024-11-05T01:44:57Z

tests/dataframe/test_describe_pytest.py

@@ -34,7 +34,7 @@ def test_flights_describe(self):
        pd_flights = self.pd_flights()
        oml_flights = self.oml_flights()

-        pd_describe = pd_flights.describe()
+        pd_describe = pd_flights.describe().drop(["timestamp"], axis=1)


why are we removing this?

In recent pandas versions, timestamp column is used in describe method. I could adapt our built-in method to pandas one, however I think it's kind of bug

dhrubo-os · 2024-11-05T01:46:14Z

tests/dataframe/test_groupby_pytest.py

    def test_groupby_aggs_mad_var_std(self, pd_agg, dropna):
        # For these aggs pandas doesn't support numeric_only
        pd_flights = self.pd_flights().filter(self.filter_data)
        oml_flights = self.oml_flights().filter(self.filter_data)

-        pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
+        if pd_agg in CustomFunctionDispatcher.customFunctionMap:


Why do we need this?

#422 (comment)

dhrubo-os · 2024-11-05T01:48:07Z

tests/dataframe/test_groupby_pytest.py

        oml_groupby = getattr(oml_flights.groupby("Cancelled", dropna=dropna), pd_agg)(
            numeric_only=True
        )
+        pd_groupby = pd_groupby[oml_groupby.columns]


Why do we need this? We shouldn't use any oml resource/info in pd as the goal is to how pd functionalities are same to oml functionality.

dhrubo-os · 2024-11-05T01:54:53Z

tests/dataframe/test_groupby_pytest.py

@@ -224,15 +240,44 @@ def test_groupby_dataframe_mad(self):
        pd_flights = self.pd_flights().filter(self.filter_data + ["DestCountry"])
        oml_flights = self.oml_flights().filter(self.filter_data + ["DestCountry"])

-        pd_mad = pd_flights.groupby("DestCountry").mad()
+        pd_mad = pd_flights.groupby("DestCountry").apply(


Can we get some idea from this PR: https://github.com/elastic/eland/pull/602/files ?

Also Shouldn't we keep BWC in mind?

I believe we don’t need to worry about backward compatibility, as we haven’t modified our built-in methods. We only customized the deprecated pandas methods within the tests.

What I was wondering currently we are doing pandas==2.0.3, but can't we do pandas>=2.0.3 ? So that we can support other versions of pandas too?

got it, let me do some research

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Yerzhaisang requested review from dhrubo-os, greaa-aws, ylwu-amzn, b4sjoo, jngz-es and rbhavna as code owners October 29, 2024 19:42

Yerzhaisang added 2 commits October 30, 2024 01:22

Bump pandas from 1.5.3 to 2.0.3

fb4f0d6

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Updated CHANGELOG

06076d9

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Yerzhaisang force-pushed the dev branch from ee1b5b6 to 06076d9 Compare October 29, 2024 20:25

pyek-bot suggested changes Oct 31, 2024

View reviewed changes

brianf-aws reviewed Oct 31, 2024

View reviewed changes

brianf-aws suggested changes Oct 31, 2024

View reviewed changes

Yerzhaisang added 10 commits November 3, 2024 13:21

Updated CHANGELOG

3ef1e2c

Signed-off-by: Yerzhaisang Taskali <[email protected]>

updated built-in method

da14ee5

Signed-off-by: Yerzhaisang Taskali <[email protected]>

removed unused comment

c68b7f8

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Implement to_list_if_needed method for list conversion

0084cce

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Refactor MAD calculation using CustomFunctionDispatcher for improved …

12bacb7

…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>

Refactor MAD calculation using CustomFunctionDispatcher for improved …

77ebf1c

…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>

Refactor MAD calculation using CustomFunctionDispatcher for improved …

45e793a

…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>

Refactor MAD calculation using CustomFunctionDispatcher for improved …

c4455e0

…readability and reusability Signed-off-by: Yerzhaisang Taskali <[email protected]>

refactor: move metric identifiers to constants.py for readability

7135507

Signed-off-by: Yerzhaisang Taskali <[email protected]>

refactor: move metric identifiers to constants.py for readability

d2203be

Signed-off-by: Yerzhaisang Taskali <[email protected]>

brianf-aws suggested changes Nov 5, 2024

View reviewed changes

dhrubo-os reviewed Nov 5, 2024

View reviewed changes

Yerzhaisang added 6 commits November 5, 2024 14:00

Removed unused line

c55639d

Signed-off-by: Yerzhaisang Taskali <[email protected]>

clarify build_pd_series docstring

b1f2319

Signed-off-by: Yerzhaisang Taskali <[email protected]>

save constansts in utils.py

77c56a7

Signed-off-by: Yerzhaisang Taskali <[email protected]>

fiexd CI

25905b6

Signed-off-by: Yerzhaisang Taskali <[email protected]>

added keyword argument to to_csv method

feeb0b5

Signed-off-by: Yerzhaisang Taskali <[email protected]>

added comment to describe method

1162608

Signed-off-by: Yerzhaisang Taskali <[email protected]>

Bump pandas from 1.5.3 to 2.0.3 #422

Are you sure you want to change the base?

Bump pandas from 1.5.3 to 2.0.3 #422

Conversation

Yerzhaisang commented Oct 29, 2024

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yerzhaisang Nov 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yerzhaisang Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yerzhaisang Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianf-aws left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yerzhaisang Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianf-aws left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yerzhaisang Nov 3, 2024 •

edited

Loading

Yerzhaisang Nov 4, 2024 •

edited

Loading

Yerzhaisang Nov 4, 2024 •

edited

Loading

Yerzhaisang Nov 5, 2024 •

edited

Loading