This repository demonstrates how to train a collection of Machine Learning models on AWS Lambda. It is built upon the project LambdaML.
Requirements
- AWS Command Line Interface (AWS CLI) version 2
- Docker
- python3.7, torch-1.0.1, boto3, numpy
Supported Algorithms
- Linear Classification
TO-DO
- use AWS CLI instead to prepare AWS resources
- use AWS CLI to clean up AWS resources upon finish
AWS resources used:
- Lambda
- Lambda 1: configuring and invoking a pool of Lambda 2
- Lambda 2: training script for each worker
- S3
- ElastiCache for Memcached
- VPC
- CloudWatch
Dataset: Higgs
Hyperparams of the training:
- Number of workers:
<n_workers>
- Specifies the number of workers used for parallel training. To distribute among <n_workers>, you'll need <n_workers> partitions of the dataset.
Setup Steps
- Configure the AWS CLI
aws configure
Also, store AWS environmental variables for easier reference in later steps.
export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
export AWS_ACCOUNT_ID=<AWS_ACCOUNT_ID>
export REGION_NAME=<REGION_NAME>
- Prepare AWS Resources with AWS console
-
A VPC. And create the endpoint to privately connect VPC to S3 (choose com.amazonaws.region.s3 as the service name).
-
An Elasticache cluster for Memcached, used for gradient exchange in distributed training. Detailed configurations may refer to the video. Copy the "configuration endpoint" for later use.
-
An S3 bucket, named as
<data_bucket>
, to store machine learning dataset -
A Lambda execution role. Assign necessary permissions:
- AmazonAPIGatewayInvokeFullAccess
- AmazonEC2FullAccess
- AmazonElastiCacheFullAccess
- AmazonS3FullAccess
- AWSKeyManagementServicePowerUser
- AWSLambda_FullAccess
- AWSLambdaRole
- CloudWatchFullAccess
- VPCLatticeServicesInvokeAccess
Store the ARN (Amazon Resource Names) of the role for later reference.
export LAMBDA_ROLE_ARN=<LAMBDA_ROLE_ARN>
- Download, partition, and upload dataset to S3
cd linear_ec
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.xz
python partition_data.py \
--file-path HIGGS.xz \
--n-workers <n_workers> \
--bucket-name <data_bucket> \
--use-dummy-data # 10 samples per partition
- Setting training configurations in the handler of Lambda 1.
- Set
host
as the "configuration endpoint" of the Memcached - Set
n_workers
anddata_bucket
- (optional) modifying other hparams
- Deploying function package on AWS Lambda with the build script:
bash build_lambda.sh
- Invoking the Lambda function
aws lambda invoke --function-name linear_ec_1 /dev/stdout
-
Open CloudWatch and Check logs
-
(Optional) Updating the function code
You must build, tag, upload the new image again. Then update the Lambda function with the new image, e.g.:
aws lambda update-function-code \
--function-name linear_ec_2 \
--image-uri "${AWS_ACCOUNT_ID}.dkr.ecr.${REGION_NAME}.amazonaws.com/linear_ec_2:latest"
Debug Steps
After building the image of the lambda, we can test it locally.
For instance, let's consider testing the image of linear_ec_2
.
- Prepare the input event as json file
python linear_ec/lambda1/save_1_worker_output.py
- Start the Docker image
docker run -p 9000:8080 \
-e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
-e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
-e AWS_ACCOUNT_ID="$AWS_ACCOUNT_ID \
-e REGION_NAME="$REGION_NAME" \
docker-image:test
- From a new terminal window, post the event to the following endpoint
curl "http://localhost:9000/2015-03-31/functions/function/invocations" \
-H "Content-Type: application/json" \
-d @linear_ec/lambda1/test_input_1_worker.json
An AWS Lambda function's code consists of your function's handler code and any required libraries. To deploy this function code to Lambda, you have two options: a .zip file archive or a container image. While using a .zip package offers shorter cold start times compared to a container package, AWS imposes size limits (50MB zipped and 250MB unzipped). Even the use of layers (with a 256MB quota) for additional storage may not support large packages. For instance, packaging PyTorch in a .zip file is not feasible due to the large size of the torch package. For example, torch-1.0.1-py37 without CUDA is 219MB, while the version with CUDA support exceeds 1.1GB.
There are two technical approaches to creating Lambda functions:
- Serverless Application Model (SAM) in YAML: A comprehensive solution for creating and configuring various AWS resources, including Lambda functions, S3, Memcached, and more.
- Creating Lambda functions using the AWS Lambda API from the command line interface (CLI) and other AWS resources via the AWS Management Console.
Ultimately, the second approach is chosen primarily because SAM enforces a memory size limit of 3008 MB in some regions (e.g., us-west-1). In contrast, the AWS Lambda API allows you to create functions with up to 10 GB of memory.