-
Notifications
You must be signed in to change notification settings - Fork 43
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add troubleshooting deployment errors page
- Loading branch information
Arief Rahmansyah
committed
Feb 2, 2024
1 parent
7c835f5
commit 9a56ba1
Showing
10 changed files
with
145 additions
and
17 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
61 changes: 61 additions & 0 deletions
61
docs/user/generated/09_troubleshooting_deployment_errors.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
<!-- page-title: Troubleshooting Deployment Errors --> | ||
<!-- parent-page-title: Deploying a Model --> | ||
|
||
# Troubleshooting Deployment Errors | ||
|
||
This page discusses scenarios you may encounter during model deployment that will require troubleshooting, including: | ||
|
||
- Image building errors | ||
- Deployment errors | ||
|
||
## Troubleshooting views | ||
|
||
Merlin provides the following views on the UI to troubeshoot a deployment: | ||
|
||
- **Logs** - the live console output when the iamge is building or the deployment is running | ||
- **History** - the list of deployment history status and message | ||
|
||
You can navigate to these views from the Model Version page by clicking on the Logs tab or History tab. | ||
|
||
![Model Version's Logs & History tabs](../../images/deployment_errors/version_log_history_tab.png) | ||
|
||
## Known Errors | ||
|
||
### OOMKilled | ||
|
||
The "OOMKilled" error occurs when a container is terminated due to out-of-memory conditions. This typically happens when a container exceeds its allocated memory limit and the system is unable to provide additional memory. When this occurs, the container will be killed with exit code 137 to free up resources. | ||
|
||
This error can effect both image building and deployment steps. To resolve the OOMKilled error, follow these steps: | ||
|
||
1. Check which components that got OOMKilled | ||
2. Check affected component memory limits | ||
3. Monitor memory usage | ||
4. Optimize model memory usage | ||
5. Adjust memory limits | ||
|
||
![Failed image building due to OOMKilled](../../images/deployment_errors/image_building_oomkilled.png) | ||
|
||
### Liveness or readiness probe failures | ||
|
||
Liveness and readiness probes are essential for ensuring the health and availability of Model services. The liveness probe is used to determine if a model is initialized and running properly, while the readiness probe indicates if a model is ready to serve traffic. When these probes fail, it can lead to service disruptions and impact the overall stability of the application. | ||
|
||
Troubleshooting steps: | ||
|
||
1. For standard model type, check pre-trained model size | ||
2. For pyfunc model type, check how model got initialized | ||
3. Inspect model logs | ||
4. Monitor resource utilization | ||
|
||
![Model service in crash loop backoff state](../../images/deployment_errors/crashloopbackoff.png) | ||
|
||
### Image not found | ||
|
||
The "Image Not Found" error occurs when Merlin is unable to locate the specified container image. This can happen for various reasons, such as the image not being available in the specified registry, incorrect image name, or network issues preventing the image pull. | ||
|
||
To troubleshoot and resolve the "Image Not Found" error, follow these steps: | ||
|
||
1. Verify image name and tag | ||
2. Check image registry | ||
3. Test image pull manually | ||
|
||
![Image not found](../../images/deployment_errors/image_not_found.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
61 changes: 61 additions & 0 deletions
61
docs/user/templates/09_troubleshooting_deployment_errors.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
<!-- page-title: Troubleshooting Deployment Errors --> | ||
<!-- parent-page-title: Deploying a Model --> | ||
|
||
# Troubleshooting Deployment Errors | ||
|
||
This page discusses scenarios you may encounter during model deployment that will require troubleshooting, including: | ||
|
||
- Image building errors | ||
- Deployment errors | ||
|
||
## Troubleshooting views | ||
|
||
Merlin provides the following views on the UI to troubeshoot a deployment: | ||
|
||
- **Logs** - the live console output when the iamge is building or the deployment is running | ||
- **History** - the list of deployment history status and message | ||
|
||
You can navigate to these views from the Model Version page by clicking on the Logs tab or History tab. | ||
|
||
![Model Version's Logs & History tabs](../../images/deployment_errors/version_log_history_tab.png) | ||
|
||
## Known Errors | ||
|
||
### OOMKilled | ||
|
||
The "OOMKilled" error occurs when a container is terminated due to out-of-memory conditions. This typically happens when a container exceeds its allocated memory limit and the system is unable to provide additional memory. When this occurs, the container will be killed with exit code 137 to free up resources. | ||
|
||
This error can effect both image building and deployment steps. To resolve the OOMKilled error, follow these steps: | ||
|
||
1. Check which components that got OOMKilled | ||
2. Check affected component memory limits | ||
3. Monitor memory usage | ||
4. Optimize model memory usage | ||
5. Adjust memory limits | ||
|
||
![Failed image building due to OOMKilled](../../images/deployment_errors/image_building_oomkilled.png) | ||
|
||
### Liveness or readiness probe failures | ||
|
||
Liveness and readiness probes are essential for ensuring the health and availability of Model services. The liveness probe is used to determine if a model is initialized and running properly, while the readiness probe indicates if a model is ready to serve traffic. When these probes fail, it can lead to service disruptions and impact the overall stability of the application. | ||
|
||
Troubleshooting steps: | ||
|
||
1. For standard model type, check pre-trained model size | ||
2. For pyfunc model type, check how model got initialized | ||
3. Inspect model logs | ||
4. Monitor resource utilization | ||
|
||
![Model service in crash loop backoff state](../../images/deployment_errors/crashloopbackoff.png) | ||
|
||
### Image not found | ||
|
||
The "Image Not Found" error occurs when Merlin is unable to locate the specified container image. This can happen for various reasons, such as the image not being available in the specified registry, incorrect image name, or network issues preventing the image pull. | ||
|
||
To troubleshoot and resolve the "Image Not Found" error, follow these steps: | ||
|
||
1. Verify image name and tag | ||
2. Check image registry | ||
3. Test image pull manually | ||
|
||
![Image not found](../../images/deployment_errors/image_not_found.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters