Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump pandas from 1.5.3 to 2.0.3 #422

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Yerzhaisang
Copy link
Contributor

Description

  • Replaced built-in mad, _construct_axes_from_arguments, quantile, and min methods with custom implementations to ensure consistent behavior.
  • Removed deprecated arguments: check_less_precise and line_terminator.
  • Updated tests for the describe method to align with current functionality.
  • Adapted the value_counts method to match changes in the new pandas version, where the column name is now required for index calls.
  • Adjusted for differing index order behavior in pandas, accommodating both single-column and multi-column DataFrames within the OpenSearch implementation.

Issues Resolved

#270

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
CHANGELOG.md Outdated
@@ -46,6 +46,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- updating listing file with three v2 sparse model - by @dhrubo-os ([#412](https://github.com/opensearch-project/opensearch-py-ml/pull/412))
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#417](https://github.com/opensearch-project/opensearch-py-ml/pull/417))
- Update model upload history - opensearch-project/opensearch-neural-sparse-encoding-v2-distill (v.1.0.0)(TORCH_SCRIPT) by @dhrubo-os ([#419](https://github.com/opensearch-project/opensearch-py-ml/pull/419))
- Bump pandas from 1.5.3 to 2.0.3 bu @yerzhaisang ([#422](https://github.com/opensearch-project/opensearch-py-ml/pull/422))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1,5 +1,5 @@
opensearch-py>=2
pandas>=1.5,<3
pandas==2.0.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we upgrade to a more latest version? any reason specifically for 2.0.3?

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping to a later version introduces datatype issues, including an ImportError like this:
ImportError: cannot import name 'is_datetime_or_timedelta_dtype' from 'pandas.core.dtypes.common'
Given that the issue was only to upgrade to a 2.x version, I thought 2.0.3 would be sufficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's focus on bumping to 2.0.3 for now and then we can create another issue to upgrade more if needed.

) -> pd.Series:
"""Builds a pd.Series while squelching the warning
for unspecified dtype on empty series
"""
dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype)
if dtype is not None:
kwargs["dtype"] = dtype
if index_name:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep an explicit check for None - if index_name is not None:

what happens if the index is not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass an index, we see the column name for which we are counting values and how it works in pandas 2.0.3.
image
If the index is not found, we don’t see the column name, and an assertion error occurs when we compare the built-in value_counts() method with the one in pandas.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep an explicit check for None - if index_name is not None:

Done

Comment on lines 427 to 429
# axes, _ = pd.DataFrame()._construct_axes_from_arguments(
# (index, columns), {}
# )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove these comments if no longer used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 431 to 447
if index is not None:
if isinstance(index, pd.Index):
index = index.tolist() # Convert Index to list
elif not is_list_like(index):
index = [index] # Convert to list if it's not list-like already
axes["index"] = index
else:
axes["index"] = None

if columns is not None:
if isinstance(columns, pd.Index):
columns = columns.tolist() # Convert Index to list
elif not is_list_like(columns):
columns = [columns] # Convert to list if it's not list-like already
axes["columns"] = columns
else:
axes["columns"] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be wrapped in a method? and then we can do something like this

axes = {
    "index": to_list_if_needed(index),
    "columns": pd.Index(to_list_if_needed(columns)) if columns is not None else None
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if columns is not None:
if isinstance(columns, pd.Index):
columns = columns.tolist() # Convert Index to list

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns to list

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 450 to 451
if not is_list_like(columns):
columns = [columns]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated logic? this is handled in lines 443 and 444 right?

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -424,9 +424,36 @@ def drop(
axis = pd.DataFrame._get_axis_name(axis)
axes = {axis: labels}
elif index is not None or columns is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of confused here the parent branch is checking that if one of them is not None but inside its checking again
Line 431 and 440
maybe this could simplified to what @pyek-bot stated about creating a wrapper for convertToList if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link

@brianf-aws brianf-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey left some comments. I notice you constantly reimplement the mad function. I would consider this a friction point if someone new decides to use mad and has to apply the lambda formula again. this is prone to errors I suggested to extract the same formula being applied into a utility class since its being repeated a lot.

What about the built_in mad function is causing inconsistent behavior is there an example you can provide?.

Also I feel a bit concerned you are choosing to use strictly 2.0.3 which kind of forces people to code using only that library can we refactor so that if we want to bump again not a lot of code changes are needed?

@@ -475,7 +475,7 @@ def _terms_aggs(
except IndexError:
name = None

return build_pd_series(results, name=name)
return build_pd_series(results, index_name=name, name="count")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come its using "count" here but before it was using name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas 2.0.3, a change in the value_counts method resulted in the following behavior:

The method now uses "count" as the name for the values column, while the original column name (e.g., "Carrier") is used for the index name. This differs from earlier versions, where the values column would inherit the name of the original column.
pd_153
pd_203

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yerzhaisang could you please share any documentation about the changing behavior of value_counts method from version 1.5.3 to 2.0.3. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.

Copy link
Contributor Author

@Yerzhaisang Yerzhaisang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas 1.5.3, the series name is used for the result series.

In pandas 2.0.3, the series name is used for the index name, while the result series name is set to this one.

@@ -106,10 +106,18 @@ def test_groupby_aggs_mad_var_std(self, pd_agg, dropna):
pd_flights = self.pd_flights().filter(self.filter_data)
oml_flights = self.oml_flights().filter(self.filter_data)

pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
if pd_agg == "mad":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to extract these variables like "mad", "var", "std" as a constant. (can remove the friction to new comers) fro example let MEAN_ABSOLUTE_DEVATION = "mad". Not sure if that would take a lot of effort but something to think about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
if pd_agg == "mad":
pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg(
lambda x: (x - x.mean()).abs().mean()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this lambda get used in other places, could we possible have a util class that dispatches theses common function names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

pd_metric = getattr(pd_flights, func)(
**({"numeric_only": True} if func != "mad" else {})
)
if func == "mad":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you introduced branching what if some day in the future we need to reimplement another function from scratch instead of creating a new branch every time can we extract this behavior out? perhaps have a utility class with our own custom implementation like

   customeFunctionMap = {"mad" : lambda x: (x - x.median()).abs().mean()}

and then dispatch it instead of branching for every new method we need to reimplent something like

if func in customFUnctionMap:
// apply the custom function
else:
// use the already given functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability

Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability

Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability

Signed-off-by: Yerzhaisang Taskali <[email protected]>
…readability and reusability

Signed-off-by: Yerzhaisang Taskali <[email protected]>
Copy link

@brianf-aws brianf-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had one piece of feedback you could implement. Code structure, I think its good (Not a pandas expert though)

Comment on lines +430 to +432
pd.Index(to_list_if_needed(columns))
if columns is not None
else None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in the implementation of

opensearch_py_ml.utils that when the value is None that you would return None already maybe we dont need a ternary operation here since its already doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Brian,

We can’t remove the ternary operation at this point because calling pd.Index(None) results in a TypeError. Additionally, we need to keep this check since it’s required here.

Comment on lines +117 to +119
if pd_agg in CustomFunctionDispatcher.customFunctionMap:
pd_groupby = pd_flights.groupby("Cancelled", dropna=dropna).agg(
lambda x: CustomFunctionDispatcher.apply_custom_function(pd_agg, x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

tests/dataframe/test_groupby_pytest.py Show resolved Hide resolved
@@ -475,7 +475,7 @@ def _terms_aggs(
except IndexError:
name = None

return build_pd_series(results, name=name)
return build_pd_series(results, index_name=name, name="count")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dhrubo-os what are your thoughts? Thanks for the info @Yerzhaisang

@@ -1,5 +1,5 @@
opensearch-py>=2
pandas>=1.5,<3
pandas==2.0.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's focus on bumping to 2.0.3 for now and then we can create another issue to upgrade more if needed.

) -> pd.Series:
"""Builds a pd.Series while squelching the warning
for unspecified dtype on empty series
"""
dtype = dtype or (EMPTY_SERIES_DTYPE if not data else dtype)
if dtype is not None:
kwargs["dtype"] = dtype
if index_name is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# specific language governing permissions and limitations
# under the License.

MEAN_ABSOLUTE_DEVIATION = "mad"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating another file, can we use utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1326,7 +1331,6 @@ def to_csv(
compression="infer",
quoting=None,
quotechar='"',
line_terminator=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored it as lineterminator to align with recent pandas updates, but it’s still not actively used elsewhere in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep as it is: line_terminator? not lineterminator?

@@ -475,7 +475,7 @@ def _terms_aggs(
except IndexError:
name = None

return build_pd_series(results, name=name)
return build_pd_series(results, index_name=name, name="count")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yerzhaisang could you please share any documentation about the changing behavior of value_counts method from version 1.5.3 to 2.0.3. I think pandas versions 1.5.3 and 2.0.3, the value_counts() method has remained consistent in functionality. Please let me know if you think otherwise.

Logstash Airways 3331
JetBeats 3274
Kibana Airlines 3234
ES-Air 3220
Name: Carrier, dtype: int64
Name: count, dtype: int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we don't want, right? Carrier is the column name which we are changing it to count and that's not right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dhrubo, this is actually correct. In pandas 2.0.3, Carrier is set as the index name, and count is the column name.

@@ -34,7 +34,7 @@ def test_flights_describe(self):
pd_flights = self.pd_flights()
oml_flights = self.oml_flights()

pd_describe = pd_flights.describe()
pd_describe = pd_flights.describe().drop(["timestamp"], axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In recent pandas versions, timestamp column is used in describe method. I could adapt our built-in method to pandas one, however I think it's kind of bug

desc_153 desc_203

def test_groupby_aggs_mad_var_std(self, pd_agg, dropna):
# For these aggs pandas doesn't support numeric_only
pd_flights = self.pd_flights().filter(self.filter_data)
oml_flights = self.oml_flights().filter(self.filter_data)

pd_groupby = getattr(pd_flights.groupby("Cancelled", dropna=dropna), pd_agg)()
if pd_agg in CustomFunctionDispatcher.customFunctionMap:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oml_groupby = getattr(oml_flights.groupby("Cancelled", dropna=dropna), pd_agg)(
numeric_only=True
)
pd_groupby = pd_groupby[oml_groupby.columns]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We shouldn't use any oml resource/info in pd as the goal is to how pd functionalities are same to oml functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed

@@ -224,15 +240,44 @@ def test_groupby_dataframe_mad(self):
pd_flights = self.pd_flights().filter(self.filter_data + ["DestCountry"])
oml_flights = self.oml_flights().filter(self.filter_data + ["DestCountry"])

pd_mad = pd_flights.groupby("DestCountry").mad()
pd_mad = pd_flights.groupby("DestCountry").apply(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some idea from this PR: https://github.com/elastic/eland/pull/602/files ?

Also Shouldn't we keep BWC in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don’t need to worry about backward compatibility, as we haven’t modified our built-in methods. We only customized the deprecated pandas methods within the tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was wondering currently we are doing pandas==2.0.3, but can't we do pandas>=2.0.3 ? So that we can support other versions of pandas too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, let me do some research

Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Signed-off-by: Yerzhaisang Taskali <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants