AWS step functions runTask retry

A demo project that shows how to implement complex retry logic using the ECS runTask api in a step function.

Scope of this project

AWS Step Functions allows you to run an ECS or FARGATE task as a step of your state machine. The step function runs your task using the ECS runTask api and waits until one of the "essential" containers exits. When your task stops because of an error, you can use the retry feature of step functions to rerun your task. But you most likely want to adopt different retry strategy based on the type of error, for example if it's a fatal error the task should not be re-executed. Though unlike lambdas, where you can access the type of error and customize your retry strategy based on that, with ECS tasks the error type is always States.TaskFailed. The only way to understand why the task failed is to parse the Cause field of the step output and get the container exit code, but that's not possible by simply using the step functions state language. This project shows how you can parse the error returned by the state machine and customize your retry strategy based on that.

How it works

The custom retry mechanism relies on the fact that the application running in the container returns a proper exit code based on the type of error occurred. When the task fails we catch the error and we send the output to the ProcessError step.

"Catch": [
    {
        "ErrorEquals": [
            "States.ALL"
        ],
        "ResultPath": "$.taskError",
        "Next": "ProcessError"
    }
]

The ProcessError step is a simple lambda function that will parse the output (which contains the error cause and the exit code) and returns an errorState as part of the output. This is an example of the output:

"errorState": {
  "type": "RETRIABLE",
  "retryCount": 1,
  "waitSeconds": 60
}

Every time the lambda processes an error output, it increases the retries count. When the maximum number of retries is reached, the lambda will return a FATAL error type.

The next step uses the Choice state to decide what's the next step based on the error state type returned.

"ErrorStateChoice": {
    "Type": "Choice",
    "Choices": [
        {
            "Variable": "$.errorState.type",
            "StringEquals": "RETRIABLE",
            "Next": "RunECSTask"
        },
        {
          "Variable": "$.errorState.type",
            "StringEquals": "TEMPORARY",
            "Next": "WaitBeforeRetry"
        },
        {
          "Variable": "$.errorState.type",
            "StringEquals": "FATAL",
            "Next": "FatalError"
        }
    ]
}

RETRIABLE: immediately try to rerun the ECS task
TEMPORARY: wait before rerun the task. The next step will be WaitBeforeRetry which is a Wait state, the number of seconds to wait is returned as part of the previous step output.
FATAL: cause the step function failure (exits immediately). The next step will be FatalError which is a Fail state.

The full definition of the state machine can be found here.

Test the step function

The state machine expects as an input the name of one of the app errors defined here. The error name will be passed to the app task and will cause the container to exit with the correspondent exit code.

Open the step function page in the AWS console and click on Start Execution. In the input section add something like this:

{
    "commands": [
      "NetworkError"
    ]
}

Development

Publish docker image to AWS ECR

Prerequisites:

Your docker client is authenticated to your AWS ECR registry
AWS credentials, AWS_REGION and AWS_ACCOUNT_ID exported as environment variables.

To build and publish the docker image to AWS ECR run:

npm run docker:publish

Build and run locally

npm run docker:build && npm run docker:run

Deploy

The project uses the serverless framework and the step functions plugin.

To deploy the project just run:

serverless deploy --subnetId <SUBNET-ID>

The subnetId parameter specifies in which subnet the fargate task will run.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
sls-include		sls-include
src		src
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
serverless.yml		serverless.yml
state-machine.json		state-machine.json
state-machine.png		state-machine.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS step functions runTask retry

Scope of this project

How it works

Test the step function

Development

Publish docker image to AWS ECR

Build and run locally

Deploy

About

Releases

Packages

Languages

mcattarinussi/step-functions-runtask-retry

Folders and files

Latest commit

History

Repository files navigation

AWS step functions runTask retry

Scope of this project

How it works

Test the step function

Development

Publish docker image to AWS ECR

Build and run locally

Deploy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages