Skip to content

Commit

Permalink
SNOW-1805851: Add scikit-learn interoperability tests.
Browse files Browse the repository at this point in the history
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
  • Loading branch information
sfc-gh-mvashishtha committed Dec 19, 2024
1 parent 0488021 commit 605bd76
Show file tree
Hide file tree
Showing 5 changed files with 310 additions and 31 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
- Updated integration testing for `session.lineage.trace` to exclude deleted objects
- Added documentation for `DataFrame.map`.
- Improve performance of `DataFrame.apply` by mapping numpy functions to snowpark functions if possible.
- Added documentation on the extent of Snowpark pandas interoperability with scikit-learn

## 1.26.0 (2024-12-05)

Expand Down
105 changes: 100 additions & 5 deletions docs/source/modin/interoperability.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
===========================================
Interoperability with third party libraries
=============================================
===========================================

Many third party libraries are interoperable with pandas, for example by accepting pandas dataframes objects as function
inputs. Here we have a non-exhaustive list of third party library use cases with pandas and note whether each method
Expand All @@ -8,15 +9,17 @@ works in Snowpark pandas as well.
Snowpark pandas supports the `dataframe interchange protocol <https://data-apis.org/dataframe-protocol/latest/>`_, which
some libraries use to interoperate with Snowpark pandas to the same level of support as pandas.

The following table is structured as follows: The first column contains a method name.
plotly.express
==============

The following table is structured as follows: The first column contains the name of a method in the ``plotly.express`` module.
The second column is a flag for whether or not interoperability is guaranteed with Snowpark pandas. For each of these
methods, we validate that passing in a Snowpark pandas dataframe as the dataframe input parameter behaves equivalently
to passing in a pandas dataframe.
operations, we validate that passing in Snowpark pandas dataframes or series as the data inputs behaves equivalently
to passing in pandas dataframes or series.

.. note::
``Y`` stands for yes, i.e., interoperability is guaranteed with this method, and ``N`` stands for no.

Plotly.express module methods

.. note::
Currently only plotly versions <6.0.0 are supported through the dataframe interchange protocol.
Expand Down Expand Up @@ -56,3 +59,95 @@ Plotly.express module methods
+-------------------------+---------------------------------------------+--------------------------------------------+
| ``imshow`` | Y | |
+-------------------------+---------------------------------------------+--------------------------------------------+


scikit-learn
============

We break down scikit-learn interoperability by categories of scikit-learn
operations.

For each category, we provide a table of interoperability with the following
structure: The first column describes a scikit-learn operation that may include
multiple method calls. The second column is a flag for whether or not
interoperability is guaranteed with Snowpark pandas. For each of these methods,
we validate that passing in Snowpark pandas objects behaves equivalently to
passing in pandas objects.

.. note::
``Y`` stands for yes, i.e., interoperability is guaranteed with this method, and ``N`` stands for no.

.. note::
While some scikit-learn methods accept Snowpark pandas inputs, their
performance with Snowpark pandas inputs is often much worse than their
performance with native pandas inputs. Generally we recommend converting
Snowpark pandas inputs to pandas with ``to_pandas()`` before passing them
to scikit-learn.


Classification
--------------

+--------------------------------------------+---------------------------------------------+---------------------------------+
| Operation | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation|
+--------------------------------------------+---------------------------------------------+---------------------------------+
| Fitting a ``LinearDiscriminantAnalysis`` | Y | |
| classifier with the ``fit()`` method and | | |
| classifying data with the ``predict()`` | | |
| method. | | |
+--------------------------------------------+---------------------------------------------+---------------------------------+


Regression
----------

+--------------------------------------------+---------------------------------------------+---------------------------------+
| Operation | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation|
+--------------------------------------------+---------------------------------------------+---------------------------------+
| Fitting a ``LogisticRegression`` model | Y | |
| with the ``fit()`` method and predicting | | |
| results with the ``predict()`` method. | | |
+--------------------------------------------+---------------------------------------------+---------------------------------+

Clustering
----------

+--------------------------------------------+---------------------------------------------+---------------------------------+
| Clustering method | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation|
+--------------------------------------------+---------------------------------------------+---------------------------------+
| ``KMeans.fit()`` | Y | |
+--------------------------------------------+---------------------------------------------+---------------------------------+


Dimensionality reduction
------------------------

+--------------------------------------------+---------------------------------------------+---------------------------------+
| Operation | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation|
+--------------------------------------------+---------------------------------------------+---------------------------------+
| Getting the principal components of a | Y | |
| numerical dataset with ``PCA.fit()`` | | |
+--------------------------------------------+---------------------------------------------+---------------------------------+


Model selection
------------------------

+--------------------------------------------+---------------------------------------------+-----------------------------------------------+
| Operation | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation |
+--------------------------------------------+---------------------------------------------+-----------------------------------------------+
| Choosing parameters for a | Y | ``RandomizedSearchCV`` causes Snowpark pandas |
| ``LogisticRegression`` model with | | to issue many queries. We strongly recommend |
| ``RandomizedSearchCV.fit()`` | | converting Snowpark pandas inputs to pandas |
| | | before using ``RandomizedSearchCV`` |
+--------------------------------------------+---------------------------------------------+-----------------------------------------------+

Preprocessing
-------------

+--------------------------------------------+---------------------------------------------+-----------------------------------------------+
| Operation | Interoperable with Snowpark pandas? (Y/N) | Notes for current implementation |
+--------------------------------------------+---------------------------------------------+-----------------------------------------------+
| Scaling training data with | Y | |
| ``MaxAbsScaler.fit_transform()`` | | |
+--------------------------------------------+---------------------------------------------+-----------------------------------------------+
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ def run(self):
*DEVELOPMENT_REQUIREMENTS,
"scipy", # Snowpark pandas 3rd party library testing
"statsmodels", # Snowpark pandas 3rd party library testing
"scikit-learn==1.5.2", # Snowpark pandas scikit-learn tests
"scikit-learn",
# plotly version restricted due to foreseen change in query counts in version 6.0.0+
"plotly<6.0.0", # Snowpark pandas 3rd party library testing
],
Expand Down
208 changes: 208 additions & 0 deletions tests/integ/modin/interoperability/scikit-learn/test_scikit_learn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
#
# Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
#

from sklearn.decomposition import PCA
from sklearn.preprocessing import MaxAbsScaler

import snowflake.snowpark.modin.plugin # noqa: F401
from tests.integ.utils.sql_counter import sql_count_checker
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.cluster import KMeans
from tests.integ.modin.utils import create_test_dfs, eval_snowpark_pandas_result
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
import pytest

"""
------
README
------
This test suite tests scikit-learn's interoperability with Snowpark pandas.
Generally, scikit-learn seems to work with Snowpark pandas inputs via a
combination of the dataframe interchange protocol and converting Snowpark
pandas inputs to numpy with methods like __array__() and np.asarray(). Some
scikit-learn methods may cause Snowpark pandas to execute many Snowflake
queries or to materialize Snowpark pandas data one or more times. We don't
plan to fix the performance of scikit-learn with Snowpark pandas inputs, and
we recommend that users convert their data to native pandas before passing it
to scikit-learn if scikit-learn takes too long with Snowpark pandas inputs.
We group the tests into scenarios into the following use cases, listed under
https://scikit-learn.org/stable/index.html:
- Classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing
Many scikit-learn methods produce non-deterministic results, and not all of
them provide a way to seed the results so that they are consistent for a test.
Generally, we only validate that 1) we can pass Snowpark pandas dataframe/series
into methods that accept native pandas inputs and 2) the outputs have the correct
type and, in case they are numpy arrays, they have the correct shape.
To test interoperability with a particular scikit-learn method:
1) Read about what the method does and how to use it
2) Start writing a test case under the test class for the category that the
method belongs to (e.g. under TestClassification for
LinearDiscriminantAnalysis)
2) Construct a use case that works with native pandas and produces a meaningful
result (for example, train a model on pandas training data and fit it to test
data)
3) Write a test case checking that replacing the pandas input with Snowpark
pandas produces results of the same type and, in the case of array-like
outputs, of the same dimensions.
4) Wrap the test with an empty sql_count_checker() decorator to see how many
queries and joins it requires. If it it requires a very large number of
queries, see if you can simplify the test case so that it causes fewer
queries, so that the test finishes quickly. If you can't reduce the number of
queries to a reasonable level, you should pass the SQL count checker the
`no_check=True` parameter because the number of queries is likely to vary
across scikit-learn and Snowpark pandas versions, and we don't gain much
insight by adjusting the query count every time it changes.
5) Add a row describing interoperability with the new method in the
[documentation](docs/source/modin/interoperability.rst)
"""


def assert_numpy_results_valid(snow_result, pandas_result) -> None:
assert isinstance(snow_result, np.ndarray)
assert isinstance(pandas_result, np.ndarray)
# Generally a meaningful test case should produce a non-empty result
assert pandas_result.size > 0
assert snow_result.shape == pandas_result.shape


@pytest.fixture()
def test_dfs():
data = {
"feature1": [1, 5, 3, 4, 4, 6, 7, 2, 9, 70],
"feature2": [2, 4, 1, 3, 5, 7, 6, 3, 10, 9],
"target": [0, 0, 1, 0, 1, 1, 1, 0, 1, 0],
}
return create_test_dfs(data)


class TestClassification:
@sql_count_checker(query_count=6)
def test_linear_discriminant_analysis(self, test_dfs):
def get_predictions(df) -> np.ndarray:
X = df[["feature1", "feature2"]]
y = df["target"]
train_size = 8
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train = y.iloc[:train_size]
return LinearDiscriminantAnalysis().fit(X_train, y_train).predict(X_test)

eval_snowpark_pandas_result(
*test_dfs, get_predictions, comparator=assert_numpy_results_valid
)


class TestRegression:
@sql_count_checker(query_count=6)
def test_logistic_regression(self, test_dfs):
def get_predictions(df) -> np.ndarray:
X = df[["feature1", "feature2"]]
y = df["target"]
train_size = 8
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train = y.iloc[:train_size]
return LogisticRegression().fit(X_train, y_train).predict(X_test)

eval_snowpark_pandas_result(
*test_dfs, get_predictions, comparator=assert_numpy_results_valid
)


class TestClustering:
@sql_count_checker(query_count=3)
def test_clustering(self, test_dfs):
def get_cluster_centers(df) -> np.ndarray:
return KMeans(n_clusters=3).fit(df).cluster_centers_

eval_snowpark_pandas_result(
*test_dfs, get_cluster_centers, comparator=assert_numpy_results_valid
)


class TestDimensionalityReduction:
@sql_count_checker(query_count=3)
def test_principal_component_analysis(self, test_dfs):
def get_principal_components(df) -> np.ndarray:
return PCA(n_components=2).fit(df).components_

eval_snowpark_pandas_result(
*test_dfs, get_principal_components, comparator=assert_numpy_results_valid
)


class TestModelSelection:
@sql_count_checker(
# Model search is a complex, iterative process. Pushing it down to
# Snowflake requires many queries (approximately 31 for this case).
# Since the number of queries and the number of joins are so large, they
# are likely to change due to changes in both scikit-learn and Snowpark
# pandas. We don't get much insight from the exact number of queries, so
# we skip the query count check. The recommended solution to this query
# explosion is for users to convert the Snowpark pandas object to pandas
# with to_pandas() and pass the result to scikit-learn.
no_check=True
)
def test_randomized_search_cv(self, test_dfs):
def get_best_estimator(df) -> dict:
# Initialize the hyperparameter search with parameters that will
# reduce the search time as much as possible.
return (
RandomizedSearchCV(
LogisticRegression(),
param_distributions={
"C": [0.001],
},
# cv=2 means 2-fold validation, which requires the fewest queries.
cv=2,
# Test just one combination of parameters.
n_iter=1,
# refit=False means that the search doesn't have to actually
# train a model using the parameters that it chooses. Setting
# refit=False should further reduce the number of queries.
refit=False,
)
.fit(df[["feature1", "feature2"]], df["target"])
.best_params_
)

def validate_search_results(snow_estimator, pandas_estimator):
assert isinstance(snow_estimator, dict)
assert isinstance(pandas_estimator, dict)

eval_snowpark_pandas_result(
*test_dfs, get_best_estimator, comparator=validate_search_results
)


class TestPreprocessing:
@sql_count_checker(query_count=5)
def test_maxabs(self, test_dfs):
eval_snowpark_pandas_result(
*test_dfs,
MaxAbsScaler().fit_transform,
comparator=assert_numpy_results_valid
)


"""
------
README
------
Please see the README at the top of this file for instructions on adding test
cases.
"""
25 changes: 0 additions & 25 deletions tests/integ/modin/test_scikit.py

This file was deleted.

0 comments on commit 605bd76

Please sign in to comment.