Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added FIS content for ECS workshop #216

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added .hugo_build.lock
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "Creating the Spot Interruption Experiment"
weight: 100
---

In this section, you're going to start creating the experiment to [trigger the interruption of Amazon EC2 Spot Instances using AWS Fault Injection Simulator (FIS)](https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/). When using Spot Instances, you need to be prepared to be interrupted. With FIS, you can test the resiliency of your workload and validate that your application is reacting to the interruption notices that EC2 sends before terminating your instances. You can target individual Spot Instances or a subset of instances in clusters managed by services that tag your instances such as ASG, Fleet and EMR.

You're going to use the CLI, so launch your terminal to run the commands included in this section.

#### What do you need to get started?

Before you start launching Spot interruptions with FIS, you need to create an experiment template. Here is where you define which resources you want to interrupt (targets), and when you want to interrupt the instance.

You're going to use the following CloudFormation template which creates the IAM role (`FISSpotRole`) with the minimum permissions FIS needs to interrupt an instance, and the experiment template (`FISExperimentTemplate`) you're going to use to trigger a Spot interruption:

```
AWSTemplateFormatVersion: 2010-09-09
Description: FIS for Spot Instances
Parameters:
InstancesToInterrupt:
Description: Number of instances to interrupt
Default: 3
Type: Number

DurationBeforeInterruption:
Description: Number of minutes before the interruption
Default: 2
Type: Number

Resources:

FISSpotRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [fis.amazonaws.com]
Action: ["sts:AssumeRole"]
Path: /
Policies:
- PolicyName: root
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: 'ec2:DescribeInstances'
Resource: '*'
- Effect: Allow
Action: 'ec2:SendSpotInstanceInterruptions'
Resource: 'arn:aws:ec2:*:*:instance/*'

FISExperimentTemplate:
Type: AWS::FIS::ExperimentTemplate
Properties:
Description: "Interrupt multiple random instances"
Targets:
SpotIntances:
ResourceTags:
ResourceTagKey: ResourceTagValue
Filters:
- Path: State.Name
Values:
- running
ResourceType: aws:ec2:spot-instance
SelectionMode: !Join ["", ["COUNT(", !Ref InstancesToInterrupt, ")"]]
Actions:
interrupt:
ActionId: "aws:ec2:send-spot-instance-interruptions"
Description: "Interrupt multiple Spot instances"
Parameters:
durationBeforeInterruption: !Join ["", ["PT", !Ref DurationBeforeInterruption, "M"]]
Targets:
SpotInstances: SpotIntances
StopConditions:
- Source: none
RoleArn: !GetAtt FISSpotRole.Arn
Tags:
Name: "FIS_EXP_NAME"

Outputs:
FISExperimentID:
Value: !GetAtt FISExperimentTemplate.Id
```

Here are some important notes about the template:

* You can configure how many instances you want to interrupt with the `InstancesToInterrupt` parameter. In the template it's defined that it's going to interrupt **three** instances.
* You can also configure how much time you want the experiment to run with the `DurationBeforeInterruption` parameter. By default, it's going to take two minutes. This means that as soon as you launch the experiment, the instance is going to receive the two-minute notification Spot interruption warning.
* The most important section is the `Targets` from the experiment template. The template has two placeholders `ResourceTagKey` and `ResourceTagValue` which are basically the key/value for the tags to use when choosing the instances to interrupt. We're going to run a `sed` command to replace them with the proper values for this workshop.
* Notice that instances are **chosen randomly**, and only those who are in the `running` state.

#### Create the EC2 Spot Interruption Experiment with FIS

Let's continue by creating the Spot interruption experiment template using Cloudformation. You can view the CloudFormation template (**fisspotinterruption.yaml**) at GitHub [here](https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/fis/fisspotinterruption.yaml). To download it, you can run the following command:

```
wget https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/fis/fisspotinterruption.yaml
```

Now, simply run the following commands to create the FIS experiment:

```
export FIS_EXP_NAME=fis-spot-interruption
sed -i -e "s#ResourceTagKey#aws:autoscaling:groupName#g" fisspotinterruption.yaml
sed -i -e "s#ResourceTagValue#EcsSpotWorkshop-ASG-SPOT#g" fisspotinterruption.yaml
sed -i -e "s#FIS_EXP_NAME#$FIS_EXP_NAME#g" fisspotinterruption.yaml
aws cloudformation create-stack --stack-name $FIS_EXP_NAME --template-body file://fisspotinterruption.yaml --capabilities CAPABILITY_NAMED_IAM
aws cloudformation wait stack-create-complete --stack-name $FIS_EXP_NAME
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: "Interrupting a Spot Instance using FIS"
weight: 110
---

In this section, you're going to launch a Spot Interruption using FIS and then verify that the capacity has been replenished and ECS cluster was able to continue running the tasks. This will help you to confirm the low impact on your workloads when implementing Spot effectively. Moreover, you can discover hidden weaknesses, and make your workloads fault-tolerant and resilient.


#### Launch the Spot Interruption Experiment
After creating the experiment template in FIS, you can start a new experiment to interrupt three (unless you changed the template) Spot instances. Run the following command:

```
FIS_EXP_TEMP_ID=$(aws cloudformation describe-stacks --stack-name $FIS_EXP_NAME --query "Stacks[0].Outputs[?OutputKey=='FISExperimentID'].OutputValue" --output text)
FIS_EXP_ID=$(aws fis start-experiment --experiment-template-id $FIS_EXP_TEMP_ID --no-cli-pager --query "experiment.id" --output text)
```

Wait around 30 seconds, and you should see that the experiment completes. Run the following command to confirm:

```
aws fis get-experiment --id $FIS_EXP_ID --no-cli-pager
```

As soon as FIS experiment completed, FIS triggered a Spot interruption notice for Spot instances ( 3 in this case ).
![fisExperiment](/images/running-ecs-on-spot/FIS.png)

Go to CloudWatch Logs group `/aws/events/spotinterruptions` to see which instances are being interrupted. You will see Spot notifications ( Rebalance Recommendations and Interruption ) for the experiment.
![spotInterruptions](/images/running-ecs-on-spot/spotInterruption.png)

You should see a log message like this one:
![SpotInterruptionLog](/images/running-ecs-on-spot/spotInterruptionlogs.png)

And after two minutes the Spot instances will be evicted. Review CloutTrail `BidEvictedEvent` Events for confirmation.
![bidEviction](/images/running-ecs-on-spot/bidEviction.png)

ECS agent running on every cluster host, monitors the Spot interruption signals and place the instance in DRAINING status. When an instance is set to DRAINING, Amazon ECS prevents new tasks from being scheduled on the instance and starts deregistering targets on target group. You can monitor this from EC2 console [ EC2 -> "Target groups" -> "EcsSpotWorkshop" (TargetGroup Name) ] or "DeregisterTargets" events in Cloudtrail.
![tgDraining](/images/running-ecs-on-spot/tgDraining.png)

ECS will try launch replacement tasks, which you can confirm with tasks in status as `Provisioning` or `Pending`.
![ecsProvisioning](/images/running-ecs-on-spot/ecsProvisioning.png)

Instance Draining caused `Pending Tasks` count to increase, which can be confirmed from EcsSpotWorkshop CloudWatch Dashboard.
![cwPendingTasks](/images/running-ecs-on-spot/cwPendingTasks.png)

And these pending ECS tasks trigger ECS Managed Target-Tracking autoscaling policy, which launches new EC2 instances to run these pending tasks.
![cwScalingPolicy](/images/running-ecs-on-spot/cwScalingPolicy.png)

From EC2 -> Auto Scaling Group -> Activity History, we can see new instances being launched.
![instanceLaunch](/images/running-ecs-on-spot/instanceLaunch.png)

Now you should see all new ECS tasks in `Running` state in ECS console.
![ECSTasksRunning](/images/running-ecs-on-spot/ECSTasksRunning.png)

Once the required ECS tasks are relaunched on new or other existing instances, after around `15 mins`, you'd observe the ECS Managed CloudWatch alarm trigger Auto Scaling Policy scale-in events to brings number of spot instances back to initial state. In current example, for first scale-in event ( 02:59:39 AM ), autoscaling group desired capacity shrinks from 7 to 5 and in next scale-in event ( 03:00:50 AM), it drop back to orignal value 4.
![cwScaleInPolicy](/images/running-ecs-on-spot/cwScaleInPolicy.png)
![instanceTermination](/images/running-ecs-on-spot/instanceTermination.png)



Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "EC2 Spot Interruption Handling in ECS"
weight: 100
weight: 94
---

The Amazon EC2 service interrupts your Spot instance when it needs the capacity back. It provides a Spot instance interruption notice, 2 minutes before the instance gets terminated. The EC2 spot interruption notification is available in two ways:
Expand Down Expand Up @@ -31,7 +31,11 @@ a **SIGTERM** System V signal sent to the application.
As a best practice application should capture the **SIGTERM** signal and implement a graceful termination mechanism. By default ECS Agent does
up to `ECS_CONTAINER_STOP_TIMEOUT`, by default 30 seconds, to handle the graceful termination of the process. After the 30 seconds a **SIGKILL**
signal is sent and the containers are forcibly stopped. The `ECS_CONTAINER_STOP_TIMEOUT` can be extended to provide some extra time, but
note that anything above the 120 seconds (2 minute notification for Spot) will result in a Spot termination.
note that anything above the 120 seconds (2 minute notification for Spot) will result in a Spot termination. Similarly it is recommended to set `Deregistration delay` for ELB Target Group ~90 secs ( default: 300 secs ), such that `DeregistrationDelay + StopTimeout` interval is less than 120 secs .


![HandlingSpotTerminations](/images/running-ecs-on-spot/HandlingSpotTerminations.png)


For this workshop we used a Python application. The code snippet below shows how our python application can capture the
IPC (Inter Process Communication) relevant signals and call a specific method `exit_gracefully` to coordinate graceful termination
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: "Tracking Spot interruptions"
weight: 96
---

Now we're in the process of getting started with adopting Spot Instances for our EMR clusters. We're still not sure that our jobs are fully resilient and what would actually happen if some of the EC2 Spot Instances in our EMR clusters get interrupted, when EC2 needs the capacity back for On-Demand.

{{% notice note %}}
In most cases, when running fault-tolerant workloads, we don't really need to track the Spot interruptions as our applications should be built to handle them gracefully without any impact to performance or availability. However, when we get started with running our EMR jobs on Spot Instances this could be useful, as our organization can use these to correlate to possible EMR job failures or prolonged execution times, in case Spot Instances were interrupted during Spark run time.
{{% /notice %}}

Let's set up CloudWatch Logs to log Spot interruptions, so if there are any failures in our EMR applications, we'll be able to check if the failures correlate to a Spot interruption.

#### Creating the CloudFormation Stack to Track EC2 Spot Interruptions

We've created a CloudFormation template that includes all the resources you need to track EC2 Spot Interruptions. The stack creates the following:

* An Event Rule for tracking EC2 Spot Interruption Warnings
* A CloudWatch Log group to log interruptions and instance details
* IAM Role to allow the event rule to log into CloudWatch Logs

You can view the CloudFormation template (**cloudwatchlogs.yaml**) at GitHub [here](https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/track-spot-interruptions/cloudwatchlogs.yaml). To download it, you can run the following command:

```
wget https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/track-spot-interruptions/cloudwatchlogs.yaml
```

After downloading the CloudFormation template, run the following command in a terminal:

```
aws cloudformation create-stack --stack-name track-spot-interruption --template-body file://cloudwatchlogs.yaml --capabilities CAPABILITY_NAMED_IAM
aws cloudformation wait stack-create-complete --stack-name track-spot-interruption
```

You should see an event rule in the Amazon EventBridge console, like this:

![Spot Interruption Event Rule](/images/tracking-spot/itn-event-rule.png)


If you are creating a new CloudWatch log group and are not explicitly setting the logs resource policy, then AWS automatically creates one. To check this,

```
aws logs describe-resource-policies --region us-east-1
```

Output:
```
{
"resourcePolicies": []
}
```

If there is no resourcePolicy for CloudWatch Log Group created, edit any one of the created EventBridge rules and update the "Log Group" to “/aws/events/spotinterruptions”.



![Update Spot Interruption Event Rule](/images/tracking-spot/updateEventRule.png)



Now confirm if required resourcePolicy is created.

```
aws logs describe-resource-policies --region us-east-1
```
Expected output:

```
{
"resourcePolicies": [
{
"policyName": "TrustEventsToStoreLogEvents",
"policyDocument": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"TrustEventsToStoreLogEvent\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":[\"delivery.logs.amazonaws.com\",\"events.amazonaws.com\"]},\"Action\":[\"logs:CreateLogStream\",\"logs:PutLogEvents\"],\"Resource\":\"arn:aws:logs:us-east-1:612606519026:log-group:/aws/events/*:*\"}]}",
"lastUpdatedTime": 1664115409696
}
]
}
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/running-ecs-on-spot/FIS.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/running-ecs-on-spot/bidEviction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/running-ecs-on-spot/tgDraining.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/tracking-spot/updateEventRule.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.