diff --git a/docs/images/deployment_errors/crashloopbackoff.png b/docs/images/deployment_errors/crashloopbackoff.png new file mode 100644 index 000000000..1a65844f3 Binary files /dev/null and b/docs/images/deployment_errors/crashloopbackoff.png differ diff --git a/docs/images/deployment_errors/image_building_oomkilled.png b/docs/images/deployment_errors/image_building_oomkilled.png new file mode 100644 index 000000000..a3b4ef8c5 Binary files /dev/null and b/docs/images/deployment_errors/image_building_oomkilled.png differ diff --git a/docs/images/deployment_errors/image_not_found.png b/docs/images/deployment_errors/image_not_found.png new file mode 100644 index 000000000..cebdd1277 Binary files /dev/null and b/docs/images/deployment_errors/image_not_found.png differ diff --git a/docs/images/deployment_errors/version_log_history_tab.png b/docs/images/deployment_errors/version_log_history_tab.png new file mode 100644 index 000000000..e36edf091 Binary files /dev/null and b/docs/images/deployment_errors/version_log_history_tab.png differ diff --git a/docs/user/generated/01_getting_started.md b/docs/user/generated/01_getting_started.md index 2b774bcd6..a5981da26 100644 --- a/docs/user/generated/01_getting_started.md +++ b/docs/user/generated/01_getting_started.md @@ -17,7 +17,7 @@ import merlin from merlin.model import ModelType # Connect to an existing Merlin deployment -merlin.set_url(merlin.example.com) +merlin.set_url("merlin.example.com") # Set the active model to the name given by parameter, if the model with the given name is not found, a new model will # be created. diff --git a/docs/user/generated/03_deploying_a_model.md b/docs/user/generated/03_deploying_a_model.md index a26040ca2..81a13aa67 100644 --- a/docs/user/generated/03_deploying_a_model.md +++ b/docs/user/generated/03_deploying_a_model.md @@ -9,4 +9,4 @@ To learn about deploying a model, please visit the following docs. {% page-ref page="./model_deployment/03_configuring_transformers.md" %} -{% page-ref page="./model_deployment/04_redeploying_a_model_version.md" %} +{% page-ref page="./model_deployment/04_redeploying_a_model_version.md" %} \ No newline at end of file diff --git a/docs/user/generated/09_troubleshooting_deployment_errors.md b/docs/user/generated/09_troubleshooting_deployment_errors.md new file mode 100644 index 000000000..3008d4980 --- /dev/null +++ b/docs/user/generated/09_troubleshooting_deployment_errors.md @@ -0,0 +1,61 @@ + + + +# Troubleshooting Deployment Errors + +This page discusses scenarios you may encounter during model deployment that will require troubleshooting, including: + +- Image building errors +- Deployment errors + +## Troubleshooting views + +Merlin provides the following views on the UI to troubeshoot a deployment: + +- **Logs** - the live console output when the iamge is building or the deployment is running +- **History** - the list of deployment history status and message + +You can navigate to these views from the Model Version page by clicking on the Logs tab or History tab. + +![Model Version's Logs & History tabs](../../images/deployment_errors/version_log_history_tab.png) + +## Known Errors + +### OOMKilled + +The "OOMKilled" error occurs when a container is terminated due to out-of-memory conditions. This typically happens when a container exceeds its allocated memory limit and the system is unable to provide additional memory. When this occurs, the container will be killed with exit code 137 to free up resources. + +This error can effect both image building and deployment steps. To resolve the OOMKilled error, follow these steps: + +1. Check which components that got OOMKilled +2. Check affected component memory limits +3. Monitor memory usage +4. Optimize model memory usage +5. Adjust memory limits + +![Failed image building due to OOMKilled](../../images/deployment_errors/image_building_oomkilled.png) + +### Liveness or readiness probe failures + +Liveness and readiness probes are essential for ensuring the health and availability of Model services. The liveness probe is used to determine if a model is initialized and running properly, while the readiness probe indicates if a model is ready to serve traffic. When these probes fail, it can lead to service disruptions and impact the overall stability of the application. + +Troubleshooting steps: + +1. For standard model type, check pre-trained model size +2. For pyfunc model type, check how model got initialized +3. Inspect model logs +4. Monitor resource utilization + +![Model service in crash loop backoff state](../../images/deployment_errors/crashloopbackoff.png) + +### Image not found + +The "Image Not Found" error occurs when Merlin is unable to locate the specified container image. This can happen for various reasons, such as the image not being available in the specified registry, incorrect image name, or network issues preventing the image pull. + +To troubleshoot and resolve the "Image Not Found" error, follow these steps: + +1. Verify image name and tag +2. Check image registry +3. Test image pull manually + +![Image not found](../../images/deployment_errors/image_not_found.png) \ No newline at end of file diff --git a/docs/user/generated/examples/02_pyfunc_model.md b/docs/user/generated/examples/02_pyfunc_model.md index 1cafb38dd..6d44c55b5 100644 --- a/docs/user/generated/examples/02_pyfunc_model.md +++ b/docs/user/generated/examples/02_pyfunc_model.md @@ -1,20 +1,23 @@ + # Deploy PyFunc Model Try out the notebooks below to learn how to deploy PyFunc Models to Merlin. **Note on compatibility**: The Pyfunc servers are compatible with `protobuf>=3.12.0,<5.0.0`. Users whose models have a strong dependency on Protobuf `3.x.x` are advised to pin the library version in their conda environment, when submitting the model version. If using Protobuf `3.x.x`, users can do one of the following: -* Use `protobuf>=3.20.0` - these versions support simplified class definitions and this is the recommended approach. -* If you must use `protobuf>=3.12.0,<3.20.0`, other packages used in the Pyfunc server need to be downgraded as well. Please pin the following in your model’s conda environment: + +- Use `protobuf>=3.20.0` - these versions support simplified class definitions and this is the recommended approach. +- If you must use `protobuf>=3.12.0,<3.20.0`, other packages used in the Pyfunc server need to be downgraded as well. Please pin the following in your model’s conda environment: + ```yaml dependencies: - pip: - - protobuf==3.15.6 # Example older protobuf version - - caraml-upi-protos<=0.3.6 - - grpcio<1.49.0 - - grpcio-reflection<1.49.0 - - grpcio-health-checking<1.49.0 + - protobuf==3.15.6 # Example older protobuf version + - caraml-upi-protos<=0.3.6 + - grpcio<1.49.0 + - grpcio-reflection<1.49.0 + - grpcio-health-checking<1.49.0 ``` ## Deploy PyFunc Model diff --git a/docs/user/templates/09_troubleshooting_deployment_errors.md b/docs/user/templates/09_troubleshooting_deployment_errors.md new file mode 100644 index 000000000..238d50b84 --- /dev/null +++ b/docs/user/templates/09_troubleshooting_deployment_errors.md @@ -0,0 +1,61 @@ + + + +# Troubleshooting Deployment Errors + +This page discusses scenarios you may encounter during model deployment that will require troubleshooting, including: + +- Image building errors +- Deployment errors + +## Troubleshooting views + +Merlin provides the following views on the UI to troubeshoot a deployment: + +- **Logs** - the live console output when the iamge is building or the deployment is running +- **History** - the list of deployment history status and message + +You can navigate to these views from the Model Version page by clicking on the Logs tab or History tab. + +![Model Version's Logs & History tabs](../../images/deployment_errors/version_log_history_tab.png) + +## Known Errors + +### OOMKilled + +The "OOMKilled" error occurs when a container is terminated due to out-of-memory conditions. This typically happens when a container exceeds its allocated memory limit and the system is unable to provide additional memory. When this occurs, the container will be killed with exit code 137 to free up resources. + +This error can effect both image building and deployment steps. To resolve the OOMKilled error, follow these steps: + +1. Check which components that got OOMKilled +2. Check affected component memory limits +3. Monitor memory usage +4. Optimize model memory usage +5. Adjust memory limits + +![Failed image building due to OOMKilled](../../images/deployment_errors/image_building_oomkilled.png) + +### Liveness or readiness probe failures + +Liveness and readiness probes are essential for ensuring the health and availability of Model services. The liveness probe is used to determine if a model is initialized and running properly, while the readiness probe indicates if a model is ready to serve traffic. When these probes fail, it can lead to service disruptions and impact the overall stability of the application. + +Troubleshooting steps: + +1. For standard model type, check pre-trained model size +2. For pyfunc model type, check how model got initialized +3. Inspect model logs +4. Monitor resource utilization + +![Model service in crash loop backoff state](../../images/deployment_errors/crashloopbackoff.png) + +### Image not found + +The "Image Not Found" error occurs when Merlin is unable to locate the specified container image. This can happen for various reasons, such as the image not being available in the specified registry, incorrect image name, or network issues preventing the image pull. + +To troubleshoot and resolve the "Image Not Found" error, follow these steps: + +1. Verify image name and tag +2. Check image registry +3. Test image pull manually + +![Image not found](../../images/deployment_errors/image_not_found.png) diff --git a/docs/user/templates/examples/02_pyfunc_model.md b/docs/user/templates/examples/02_pyfunc_model.md index 1cafb38dd..c11190b64 100644 --- a/docs/user/templates/examples/02_pyfunc_model.md +++ b/docs/user/templates/examples/02_pyfunc_model.md @@ -1,20 +1,23 @@ + # Deploy PyFunc Model Try out the notebooks below to learn how to deploy PyFunc Models to Merlin. **Note on compatibility**: The Pyfunc servers are compatible with `protobuf>=3.12.0,<5.0.0`. Users whose models have a strong dependency on Protobuf `3.x.x` are advised to pin the library version in their conda environment, when submitting the model version. If using Protobuf `3.x.x`, users can do one of the following: -* Use `protobuf>=3.20.0` - these versions support simplified class definitions and this is the recommended approach. -* If you must use `protobuf>=3.12.0,<3.20.0`, other packages used in the Pyfunc server need to be downgraded as well. Please pin the following in your model’s conda environment: + +- Use `protobuf>=3.20.0` - these versions support simplified class definitions and this is the recommended approach. +- If you must use `protobuf>=3.12.0,<3.20.0`, other packages used in the Pyfunc server need to be downgraded as well. Please pin the following in your model’s conda environment: + ```yaml dependencies: - pip: - - protobuf==3.15.6 # Example older protobuf version - - caraml-upi-protos<=0.3.6 - - grpcio<1.49.0 - - grpcio-reflection<1.49.0 - - grpcio-health-checking<1.49.0 + - protobuf==3.15.6 # Example older protobuf version + - caraml-upi-protos<=0.3.6 + - grpcio<1.49.0 + - grpcio-reflection<1.49.0 + - grpcio-health-checking<1.49.0 ``` ## Deploy PyFunc Model @@ -23,4 +26,4 @@ dependencies: ## Deploy PyFunc Model with Custom Prometheus Metrics -{% embed url="https://github.com/caraml-dev/merlin/blob/main/examples/metrics/Metrics.ipynb" %} \ No newline at end of file +{% embed url="https://github.com/caraml-dev/merlin/blob/main/examples/metrics/Metrics.ipynb" %}