AutoML: Improve notebooks for easier usage on Google Colab #174

ckurze · 2023-11-30T16:53:42Z

About

For an upcoming presentation, this patch sands the AutoML notebooks a bit more, for improved usability. It is specifically about making it easier to run them on the Google Colab computing environment.

Details

Improve Python dependency definition by inlining it into the notebook.
Improve database connectivity definition by only using a single configuration value.

ckurze · 2023-11-30T16:59:02Z

@amotl Could you please have a final review on the changes?

topic/machine-learning/automl/automl_classification_with_pycaret.ipynb

topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb

FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 240 seconds. The message was: Cell execution timed out. Here is a preview of the cell contents: ------------------- s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)

amotl · 2023-12-02T22:06:24Z

It does not converge. I already tried with 86e77a7, but still: CellTimeoutError.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870

andnig · 2023-12-03T08:28:07Z

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

The highlighted cell just does the setup of the data to train the models on. It can't run for more than a Minute or so.
As our tests previously were successful at 240s timeout and now even fail at 300s is there a chance that the database connection for mlflow logging was changed? Maybe you can try setting log_experiments to false and check if this cell executes faster. Maybe it tries to get to the logging database but fails to do so.

amotl · 2023-12-03T13:29:36Z

The highlighted cell just does the setup of the data to train the models on. It can't run for more than a Minute or so.

So, you think the cell timeout actually isn't the consequence of hogging the CPU too much, but is rather originating from a network timeout? Interesting. I will consider this idea when debugging the problem in more detail.

Maybe you can try setting log_experiments to False?

Sure, will try. Thanks.

amotl · 2023-12-03T17:06:48Z

is there a chance that the database connection for mlflow logging was changed?

All the nightly scheduled runs of the "AutoML" job succeed, so it is absolutely likely this has caused the issue. Thanks!

In fact, @ckurze was improving database connection details here, so this patch will most probably need an update to the test runner, similar to how crate/crate-python#176 was needed to bring in crate/crate-python#173. I will take care about it, and add a corresponding commit here.

amotl · 2023-12-03T22:47:26Z

[...] improving database connection details here, so this patch will most probably need an update to the test runner [...]

This is the spot where corresponding environment variables for the test suite are being defined.

cratedb-examples/topic/machine-learning/automl/pyproject.toml

Lines 7 to 14 in c4cad47

    
           env = [ 
        
               "CRATEDB_CONNECTION_STRING=crate://crate@localhost/?schema=testdrive", 
        
               "CRATE_USER=crate", 
        
               "CRATE_PASSWORD=", 
        
               "CRATE_HOST=localhost", 
        
               "CRATE_SSL=false", 
        
               "PYDEVD_DISABLE_FILE_VALIDATION=1", 
        
           ]

Most probably, taking into account what @andnig is suggesting, those settings don't harmonize.

# Use CrateDB for MLflow tracking.
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"

https://github.com/crate/cratedb-examples/blob/c4cad47ad95e331600613cdbef56506e057973b0/topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb?short_path=ff6d6a4#L255-L256

amotl · 2023-12-03T23:02:48Z

topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb

-    "os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{dburi}&schema=mlflow\""
+    "os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{CONNECTION_STRING}&schema=mlflow\""


Problem

Previously, dburi contained an SQLAlchemy database address URI / connection string without the schema parameter, so it was good to append one.

Now that it is coming from the CRATEDB_CONNECTION_STRING environment variable, which already includes a schema parameter, things most probably go south when trying to address the correct database location.

Thanks for the guidance into this direction, @andnig. I think it is a bit sad that the process runs into a timeout situation. While I did not yet look into the details why exactly this is happening, a more explicit exception raised by the software would have been nice, for better UX.

Note that this issue is exclusively related to software testing, and does not mean the notebook is incorrect. It is only about that it currently doesn't harmonize with automated testing.

Observation

Now that we know what to look for, it is easy to spot the correct exception. It does not help too much that it is printed in green, and that I am not too familiar with analyzing pytest-notebook stacktraces yet.

2023/12/03 00:03:58 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught. (crate.client.exceptions.ProgrammingError) expected string or bytes-like object, got 'tuple' (Background on this error at: https://sqlalche.me/e/20/f405) Operation will be retried in 0.1 seconds

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:952

Evaluation

That exception exactly reflects what's happening: The machinery receives two values for the schema parameter, in form of a tuple, and trips correspondingly.

Thoughts

Well, that this operation is retried, might by okay from the perspective of MLflow, because it probably can't judge good enough about the type of the underlying error.

However, regardless of the error, we see MLflow is retrying with increasing delay, using an exponential backoff mechanism. While that is absolutely perfectly fine in production environments, we may want to adjust the maximum delay to a much shorter time when running test cases, both in sandbox and CI environments.

When doing so, the error would have been revealed instantly, and not masked too much by being deferred to so much later in time.

On my machine, it yields a slightly different error message about the same error (schema parameter is present two times). Even when configuring nb_exec_timeout = 15 in pyproject.toml, it still attempts to retry for ages.

2023/12/04 00:42:00 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught. (crate.client.exceptions.ProgrammingError) expected string or bytes-like object (Background on this error at: https://sqlalche.me/e/20/f405) Operation will be retried in 0.1 seconds

[...]

2023/12/04 00:45:24 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught. (crate.client.exceptions.ProgrammingError) expected string or bytes-like object (Background on this error at: https://sqlalche.me/e/20/f405) Operation will be retried in 204.7 seconds

2023/12/04 00:48:48 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught. (crate.client.exceptions.ProgrammingError) expected string or bytes-like object (Background on this error at: https://sqlalche.me/e/20/f405) Operation will be retried in 409.5 seconds

Apparently, there are no time-based adjustments for the reconnection loop of MLflow.

The maximum number of times to retry is configured as MAX_RETRY_COUNT = 15.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L55

The reconnection loop will retry for up to 55 minutes (3276.7 seconds), if I got it right.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L224-L243

NB: On my machine, the process does not seem to respect the nb_exec_timeout configuration setting at all, already time.sleeping() at 819.1 seconds after launch.

Concluding all of the observations shared above, I think it will be a good idea to probe the database connection before going into the reconnect loop.

-- https://github.com/crate-workbench/mlflow-cratedb

There is a followup error when naively configuring the database connection like

CRATEDB_CONNECTION_STRING=crate://crate@localhost/

Because &schema=mlflow will appended, the driver will effectively use crate://crate@localhost/&schema=mlflow, which will raise a different error, but will still go into the reconnect loop.

2023/12/04 01:20:53 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught. Connection.__init__() got an unexpected keyword argument 'database' Operation will be retried in 102.3 seconds

&schema=mlflow will be interpreted as a database parameter, and the Python driver doesn't accept it, because CrateDB does not understand the concept of databases.

NB: It will be sweet to improve the Python driver to raise an error message which reflects this detail better.

Well, this was successful. After fixing the database connectivity settings propagation, test cases signal "green" status now, and there are three more tasks to work on, in order to improve upstream library and adapter software.

https://github.com/crate/crate-python/issues/598

https://github.com/crate/crate-python/issues/599

Improve database reconnect loop behavior when connectivity configuration is incorrect mlflow-cratedb#73

amotl · 2023-12-03T23:03:19Z

topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb

    "    \"MLFLOW_TRACKING_URI\"\n",
-    "] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
+    "] = f\"{CONNECTION_STRING}&schema=mlflow\""


amotl · 2023-12-03T23:05:43Z

topic/machine-learning/automl/automl_classification_with_pycaret.ipynb

-    "] = f\"{dburi}&schema=mlflow\""
+    "] = f\"{CONNECTION_STRING}&schema=mlflow\""


The same would apply here, but on the other notebook, the error doesn't happen?

Oh well, it does.

============= short test summary info ============= FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds. FAILED test.py::test_notebook[automl_classification_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds. ============= 2 failed, 2 passed, 62 warnings in 880.39s (0:14:40) =============

amotl · 2023-12-03T23:07:07Z

topic/machine-learning/automl/automl_classification_with_pycaret.ipynb

-    "] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
+    "] = f\"{CONNECTION_STRING}&schema=mlflow\""


Concatenating the `schema` query parameter to the SQLAlchemy connection string correctly is crucial. In order to avoid anomalies or confusion, this patch makes it so that both types of connection strings (regular data vs. MLflow tracking) are configured side-by-side now, so it is easier to understand what is going on.

ckurze added 2 commits December 2, 2023 22:02

ML/AutoML: Reworked connection string for easier usage

89abd38

ML/AutoML: Slight changes for requirements

637960d

amotl force-pushed the ckurze/automl branch from 98e2bd0 to d64a84e Compare December 2, 2023 21:03

amotl reviewed Dec 2, 2023

View reviewed changes

ckurze and others added 2 commits December 2, 2023 22:23

ML/AutoML: Harmonize notebooks, and easier connection string handling

f9658b3

amotl force-pushed the ckurze/automl branch from d64a84e to 86e77a7 Compare December 2, 2023 21:26

This comment was marked as duplicate.

Sign in to view

amotl reviewed Dec 3, 2023

View reviewed changes

amotl changed the title ~~Minor changes on AutoML notebooks for easier usage on Google Colab~~ AutoML: Improve notebooks for easier usage on Google Colab Dec 3, 2023

amotl merged commit 5f6e1f4 into main Dec 4, 2023
1 check passed

amotl deleted the ckurze/automl branch December 4, 2023 02:06

amotl mentioned this pull request Dec 4, 2023

AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. #170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML: Improve notebooks for easier usage on Google Colab #174

AutoML: Improve notebooks for easier usage on Google Colab #174

ckurze commented Nov 30, 2023 •

edited by amotl

Loading

ckurze commented Nov 30, 2023

amotl commented Dec 2, 2023

This comment was marked as duplicate.

andnig commented Dec 3, 2023

amotl commented Dec 3, 2023 •

edited

Loading

amotl commented Dec 3, 2023 •

edited

Loading

amotl commented Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 4, 2023

amotl Dec 4, 2023 •

edited

Loading

amotl Dec 4, 2023

amotl Dec 3, 2023

amotl Dec 3, 2023

amotl Dec 3, 2023

amotl Dec 3, 2023

		"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{dburi}&schema=mlflow\""
		"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{CONNECTION_STRING}&schema=mlflow\""

		"] = f\"{dburi}&schema=mlflow\""
		"] = f\"{CONNECTION_STRING}&schema=mlflow\""

		"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
		"] = f\"{CONNECTION_STRING}&schema=mlflow\""

AutoML: Improve notebooks for easier usage on Google Colab #174

AutoML: Improve notebooks for easier usage on Google Colab #174

Conversation

ckurze commented Nov 30, 2023 • edited by amotl Loading

About

Details

ckurze commented Nov 30, 2023

amotl commented Dec 2, 2023

This comment was marked as duplicate.

andnig commented Dec 3, 2023

amotl commented Dec 3, 2023 • edited Loading

amotl commented Dec 3, 2023 • edited Loading

amotl commented Dec 3, 2023 • edited Loading

amotl Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

Problem

amotl Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

Observation

Evaluation

Thoughts

amotl Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

amotl Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

amotl Dec 4, 2023

Choose a reason for hiding this comment

amotl Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

amotl Dec 4, 2023

Choose a reason for hiding this comment

amotl Dec 3, 2023

Choose a reason for hiding this comment

amotl Dec 3, 2023

Choose a reason for hiding this comment

amotl Dec 3, 2023

Choose a reason for hiding this comment

amotl Dec 3, 2023

Choose a reason for hiding this comment

ckurze commented Nov 30, 2023 •

edited by amotl

Loading

amotl commented Dec 3, 2023 •

edited

Loading

amotl commented Dec 3, 2023 •

edited

Loading

amotl commented Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 3, 2023 •

edited

Loading

amotl Dec 4, 2023 •

edited

Loading