From fc5f5ace4b1a5855682bb6ee531af29526147483 Mon Sep 17 00:00:00 2001 From: Anirban Pal Date: Wed, 30 Oct 2024 02:59:38 +0700 Subject: [PATCH 1/3] fixed [Docs] Spot/interruptible docs imply retries come from the user retry budget #3956 Signed-off-by: Anirban Pal --- .../concepts/main_concepts/tasks.rst | 33 +++++++++++++++ .../flyte_fundamentals/optimizing_tasks.md | 41 +++++++++++++++++++ 2 files changed, 74 insertions(+) diff --git a/docs/user_guide/concepts/main_concepts/tasks.rst b/docs/user_guide/concepts/main_concepts/tasks.rst index 8e5cc7aaec..0f2d5f30d6 100644 --- a/docs/user_guide/concepts/main_concepts/tasks.rst +++ b/docs/user_guide/concepts/main_concepts/tasks.rst @@ -123,3 +123,36 @@ Caching/Memoization Flyte supports memoization of task outputs to ensure that identical invocations of a task are not executed repeatedly, thereby saving compute resources and execution time. For example, if you wish to run the same piece of code multiple times, you can reuse the output instead of re-computing it. For more information on memoization, refer to the :std:doc:`/user_guide/development_lifecycle/caching`. + +### Retries and Spot Instances + +Tasks can define a retry strategy to handle different types of failures: + +1. **System Retries**: Used for infrastructure-level failures outside of user control: + - Spot instance preemptions + - Network issues + - Service unavailability + - Hardware failures + + *Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion. + +2. **User Retries**: Specified in the `@task` decorator (via `retries` parameter), used for: + - Application-level errors + - Invalid input handling + - Business logic failures + +```python +@task(retries=3) # Sets user retry budget to 3 +def my_task() -> None: + ... +``` + +### Alternative Retry Behavior + +Starting with RFC 3902, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: + +1. Set `configmap.core.propeller.node-config.ignore-retry-cause` to `true` in helm values +2. Define retries in the task decorator to set the total retry budget +3. The last retries will automatically run on non-spot instances + +This provides a simpler, more predictable retry behavior while maintaining reliability. \ No newline at end of file diff --git a/docs/user_guide/flyte_fundamentals/optimizing_tasks.md b/docs/user_guide/flyte_fundamentals/optimizing_tasks.md index 6796876bb6..5be50448ea 100644 --- a/docs/user_guide/flyte_fundamentals/optimizing_tasks.md +++ b/docs/user_guide/flyte_fundamentals/optimizing_tasks.md @@ -273,6 +273,47 @@ the resources that you need. In this case, that need is distributed training, but Flyte also provides integrations for {ref}`Spark `, {ref}`Ray `, {ref}`MPI `, {ref}`Snowflake `, and more. +## Retries and Spot Instances + +When running tasks on spot/interruptible instances, it's important to understand how retries work: + +```python +from flytekit import task + +@task( + retries=3, # User retry budget + interruptible=True # Enables running on spot instances +) +def my_task() -> None: + ... +``` + +### Default Retry Behavior +- Spot instance preemptions count against the system retry budget (not user retries) +- The last system retry automatically runs on a non-preemptible instance +- User retries (specified in `@task` decorator) are only used for application errors + +### Simplified Retry Behavior +Flyte also offers a simplified retry model where both system and user retries count towards a single budget: + +```python +@task( + retries=5, # Total retry budget for both system and user errors + interruptible=True +) +def my_task() -> None: + ... +``` + +To enable this behavior: +1. Set `configmap.core.propeller.node-config.ignore-retry-cause=true` in platform config +2. Define total retry budget in task decorator +3. Last retries automatically run on non-spot instances + +Choose the retry model that best fits your use case: +- Default: Separate budgets for system vs user errors +- Simplified: Single retry budget with guaranteed completion + Even though Flyte itself is a powerful compute engine and orchestrator for data engineering, machine learning, and analytics, perhaps you have existing code that leverages other platforms. Flyte recognizes the pain of migrating code, From 56afdfc24c041ac491b97b6f0f9f1fafa45913c4 Mon Sep 17 00:00:00 2001 From: Anirban Pal Date: Wed, 30 Oct 2024 23:41:47 +0700 Subject: [PATCH 2/3] updated line 152 tasks.rst Signed-off-by: Anirban Pal --- docs/user_guide/concepts/main_concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/concepts/main_concepts/tasks.rst b/docs/user_guide/concepts/main_concepts/tasks.rst index 0f2d5f30d6..035957587e 100644 --- a/docs/user_guide/concepts/main_concepts/tasks.rst +++ b/docs/user_guide/concepts/main_concepts/tasks.rst @@ -149,7 +149,7 @@ def my_task() -> None: ### Alternative Retry Behavior -Starting with RFC 3902, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: +Starting with from 1.10.0, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: 1. Set `configmap.core.propeller.node-config.ignore-retry-cause` to `true` in helm values 2. Define retries in the task decorator to set the total retry budget From 7f7a5a9f28406172d6220ca61645b1bb7b4ab130 Mon Sep 17 00:00:00 2001 From: Anirban Pal Date: Thu, 31 Oct 2024 04:49:07 +0700 Subject: [PATCH 3/3] updated formatting of tasks.rst Signed-off-by: Anirban Pal --- .../concepts/main_concepts/tasks.rst | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/user_guide/concepts/main_concepts/tasks.rst b/docs/user_guide/concepts/main_concepts/tasks.rst index 035957587e..90d3e1f750 100644 --- a/docs/user_guide/concepts/main_concepts/tasks.rst +++ b/docs/user_guide/concepts/main_concepts/tasks.rst @@ -133,26 +133,26 @@ Tasks can define a retry strategy to handle different types of failures: - Network issues - Service unavailability - Hardware failures - - *Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion. -2. **User Retries**: Specified in the `@task` decorator (via `retries` parameter), used for: +*Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion. + +2. **User Retries**: Specified in the ``@task`` decorator (via ``retries`` parameter), used for: - Application-level errors - Invalid input handling - Business logic failures -```python -@task(retries=3) # Sets user retry budget to 3 -def my_task() -> None: - ... -``` +.. code-block:: python + + @task(retries=3) # Sets user retry budget to 3 + def my_task() -> None: + ... ### Alternative Retry Behavior -Starting with from 1.10.0, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: +Starting from 1.10.0, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this: -1. Set `configmap.core.propeller.node-config.ignore-retry-cause` to `true` in helm values +1. Set ``configmap.core.propeller.node-config.ignore-retry-cause`` to ``true`` in helm values 2. Define retries in the task decorator to set the total retry budget 3. The last retries will automatically run on non-spot instances -This provides a simpler, more predictable retry behavior while maintaining reliability. \ No newline at end of file +This provides a simpler, more predictable retry behavior while maintaining reliability. \ No newline at end of file