From d94833b235393bcf192e0d5d6bb4f4a8c612ee41 Mon Sep 17 00:00:00 2001 From: Anirban Pal Date: Wed, 30 Oct 2024 02:59:38 +0700 Subject: [PATCH] fixed [Docs] Spot/interruptible docs imply retries come from the user retry budget #3956 --- .../concepts/main_concepts/tasks.rst | 33 +++++++++++++++ .../flyte_fundamentals/optimizing_tasks.md | 41 +++++++++++++++++++ 2 files changed, 74 insertions(+) diff --git a/docs/user_guide/concepts/main_concepts/tasks.rst b/docs/user_guide/concepts/main_concepts/tasks.rst index 8e5cc7aaecf..0f2d5f30d6e 100644 --- a/docs/user_guide/concepts/main_concepts/tasks.rst +++ b/docs/user_guide/concepts/main_concepts/tasks.rst @@ -123,3 +123,36 @@ Caching/Memoization Flyte supports memoization of task outputs to ensure that identical invocations of a task are not executed repeatedly, thereby saving compute resources and execution time. For example, if you wish to run the same piece of code multiple times, you can reuse the output instead of re-computing it. For more information on memoization, refer to the :std:doc:`/user_guide/development_lifecycle/caching`. + +### Retries and Spot Instances + +Tasks can define a retry strategy to handle different types of failures: + +1. **System Retries**: Used for infrastructure-level failures outside of user control: + - Spot instance preemptions + - Network issues + - Service unavailability + - Hardware failures + + *Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion. + +2. **User Retries**: Specified in the `@task` decorator (via `retries` parameter), used for: + - Application-level errors + - Invalid input handling + - Business logic failures + +```python +@task(retries=3) # Sets user retry budget to 3 +def my_task() -> None: + ... +``` + +### Alternative Retry Behavior + +Starting with RFC 3902, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: + +1. Set `configmap.core.propeller.node-config.ignore-retry-cause` to `true` in helm values +2. Define retries in the task decorator to set the total retry budget +3. The last retries will automatically run on non-spot instances + +This provides a simpler, more predictable retry behavior while maintaining reliability. \ No newline at end of file diff --git a/docs/user_guide/flyte_fundamentals/optimizing_tasks.md b/docs/user_guide/flyte_fundamentals/optimizing_tasks.md index 6796876bb62..5be50448eaf 100644 --- a/docs/user_guide/flyte_fundamentals/optimizing_tasks.md +++ b/docs/user_guide/flyte_fundamentals/optimizing_tasks.md @@ -273,6 +273,47 @@ the resources that you need. In this case, that need is distributed training, but Flyte also provides integrations for {ref}`Spark `, {ref}`Ray `, {ref}`MPI `, {ref}`Snowflake `, and more. +## Retries and Spot Instances + +When running tasks on spot/interruptible instances, it's important to understand how retries work: + +```python +from flytekit import task + +@task( + retries=3, # User retry budget + interruptible=True # Enables running on spot instances +) +def my_task() -> None: + ... +``` + +### Default Retry Behavior +- Spot instance preemptions count against the system retry budget (not user retries) +- The last system retry automatically runs on a non-preemptible instance +- User retries (specified in `@task` decorator) are only used for application errors + +### Simplified Retry Behavior +Flyte also offers a simplified retry model where both system and user retries count towards a single budget: + +```python +@task( + retries=5, # Total retry budget for both system and user errors + interruptible=True +) +def my_task() -> None: + ... +``` + +To enable this behavior: +1. Set `configmap.core.propeller.node-config.ignore-retry-cause=true` in platform config +2. Define total retry budget in task decorator +3. Last retries automatically run on non-spot instances + +Choose the retry model that best fits your use case: +- Default: Separate budgets for system vs user errors +- Simplified: Single retry budget with guaranteed completion + Even though Flyte itself is a powerful compute engine and orchestrator for data engineering, machine learning, and analytics, perhaps you have existing code that leverages other platforms. Flyte recognizes the pain of migrating code,