-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML: Improve notebooks for easier usage on Google Colab #174
Conversation
@amotl Could you please have a final review on the changes? |
topic/machine-learning/automl/automl_classification_with_pycaret.ipynb
Outdated
Show resolved
Hide resolved
topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb
Show resolved
Hide resolved
topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb
Outdated
Show resolved
Hide resolved
FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 240 seconds. The message was: Cell execution timed out. Here is a preview of the cell contents: ------------------- s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
It does not converge. I already tried with 86e77a7, but still: E nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E The message was: Cell execution timed out.
E Here is a preview of the cell contents:
E -------------------
E s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E -------------------
/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError -- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870 |
This comment was marked as duplicate.
This comment was marked as duplicate.
The highlighted cell just does the setup of the data to train the models on. It can't run for more than a Minute or so. |
So, you think the cell timeout actually isn't the consequence of hogging the CPU too much, but is rather originating from a network timeout? Interesting. I will consider this idea when debugging the problem in more detail.
Sure, will try. Thanks. |
All the nightly scheduled runs of the "AutoML" job succeed, so it is absolutely likely this has caused the issue. Thanks! In fact, @ckurze was improving database connection details here, so this patch will most probably need an update to the test runner, similar to how crate/crate-python#176 was needed to bring in crate/crate-python#173. I will take care about it, and add a corresponding commit here. |
This is the spot where corresponding environment variables for the test suite are being defined. cratedb-examples/topic/machine-learning/automl/pyproject.toml Lines 7 to 14 in c4cad47
Most probably, taking into account what @andnig is suggesting, those settings don't harmonize. # Use CrateDB for MLflow tracking.
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow" |
"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{dburi}&schema=mlflow\"" | ||
"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{CONNECTION_STRING}&schema=mlflow\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem
Previously, dburi
contained an SQLAlchemy database address URI / connection string without the schema
parameter, so it was good to append one.
Now that it is coming from the CRATEDB_CONNECTION_STRING
environment variable, which already includes a schema
parameter, things most probably go south when trying to address the correct database location.
Thanks for the guidance into this direction, @andnig. I think it is a bit sad that the process runs into a timeout situation. While I did not yet look into the details why exactly this is happening, a more explicit exception raised by the software would have been nice, for better UX.
Note that this issue is exclusively related to software testing, and does not mean the notebook is incorrect. It is only about that it currently doesn't harmonize with automated testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation
Now that we know what to look for, it is easy to spot the correct exception. It does not help too much that it is printed in green, and that I am not too familiar with analyzing pytest-notebook stacktraces yet.
2023/12/03 00:03:58 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object, got 'tuple'
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 0.1 seconds
-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:952
Evaluation
That exception exactly reflects what's happening: The machinery receives two values for the schema
parameter, in form of a tuple, and trips correspondingly.
Thoughts
Well, that this operation is retried, might by okay from the perspective of MLflow, because it probably can't judge good enough about the type of the underlying error.
However, regardless of the error, we see MLflow is retrying with increasing delay, using an exponential backoff mechanism. While that is absolutely perfectly fine in production environments, we may want to adjust the maximum delay to a much shorter time when running test cases, both in sandbox and CI environments.
When doing so, the error would have been revealed instantly, and not masked too much by being deferred to so much later in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, it yields a slightly different error message about the same error (schema
parameter is present two times). Even when configuring nb_exec_timeout = 15
in pyproject.toml
, it still attempts to retry for ages.
2023/12/04 00:42:00 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 0.1 seconds
[...]
2023/12/04 00:45:24 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 204.7 seconds
2023/12/04 00:48:48 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 409.5 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, there are no time-based adjustments for the reconnection loop of MLflow.
The maximum number of times to retry is configured as MAX_RETRY_COUNT = 15
.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L55
The reconnection loop will retry for up to 55 minutes (3276.7 seconds), if I got it right.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L224-L243
NB: On my machine, the process does not seem to respect the nb_exec_timeout
configuration setting at all, already time.sleeping()
at 819.1 seconds after launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concluding all of the observations shared above, I think it will be a good idea to probe the database connection before going into the reconnect loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a followup error when naively configuring the database connection like
CRATEDB_CONNECTION_STRING=crate://crate@localhost/
Because &schema=mlflow
will appended, the driver will effectively use crate://crate@localhost/&schema=mlflow
, which will raise a different error, but will still go into the reconnect loop.
2023/12/04 01:20:53 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
Connection.__init__() got an unexpected keyword argument 'database'
Operation will be retried in 102.3 seconds
&schema=mlflow
will be interpreted as a database
parameter, and the Python driver doesn't accept it, because CrateDB does not understand the concept of databases.
NB: It will be sweet to improve the Python driver to raise an error message which reflects this detail better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this was successful. After fixing the database connectivity settings propagation, test cases signal "green" status now, and there are three more tasks to work on, in order to improve upstream library and adapter software.
" \"MLFLOW_TRACKING_URI\"\n", | ||
"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\"" | ||
"] = f\"{CONNECTION_STRING}&schema=mlflow\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito.
"] = f\"{dburi}&schema=mlflow\"" | ||
"] = f\"{CONNECTION_STRING}&schema=mlflow\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same would apply here, but on the other notebook, the error doesn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well, it does.
============= short test summary info =============
FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
FAILED test.py::test_notebook[automl_classification_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
============= 2 failed, 2 passed, 62 warnings in 880.39s (0:14:40) =============
"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\"" | ||
"] = f\"{CONNECTION_STRING}&schema=mlflow\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito.
Concatenating the `schema` query parameter to the SQLAlchemy connection string correctly is crucial. In order to avoid anomalies or confusion, this patch makes it so that both types of connection strings (regular data vs. MLflow tracking) are configured side-by-side now, so it is easier to understand what is going on.
About
For an upcoming presentation, this patch sands the AutoML notebooks a bit more, for improved usability. It is specifically about making it easier to run them on the Google Colab computing environment.
Details