Architecting Machine Learning on AWS: A Practitioner’s Guide to Production-Grade ML on AWS Cloud with SageMaker
(Revision History: PA1, 2020-04-14, @akirmak: Initial version
Welcome to the Architecting Machine Learning on AWS using SageMaker workshop.
The objective of this workshop is to provide a practitioner's guide to challenges of real-world ML problems, and demonstrate examples of how to tackle them on AWS Cloud.
There is abundance of online training (online training in Coursera, Data Camp, O'Reilly Online, A Cloud Guru, Udemy), books & articles (medium, blog posts) and code (For AWS SageMaker only, there are hundreds of Sample Notebooks from every imaginable use case, ML domain). So how is this workshop different? We aim to bring together theory & practice, as well as datasets (both small ones, to try out quickly, but also large ones to demonstrate big data related challenges) for an architectural perspectives.
Areas covered:
-
Data Engineering: You will explore various examples of Data Exploration, Feature Engineering & Data Cleaning using popular frameworks on python using a Jupyter Notebook (pandas, numPy, SeaBorn etc.). Since these frameworks are suitable for small-datasets (KB's, to single digit GBs, or to double digit GBs in extreme cases with very strong instances), they are suitable for working on a subset of the actual datasets. You will also experiment with big data analytics services can be used for data engineering, explorary data analytics; and how data is carried over from the data lake storage to compute clusters using S3. Services covered are: Amazon Athena - a Presto/HIVE based service for Ad-HoC analytics, Glue for Data Catalog Management and S3 for Data Lake Storage.
-
ML Training: You will train various supervised learning algorithms for regression & classification (For Regression: Decision Trees, For Classification: Logistic Regression & Artificial Neural Networks based Image Classification). You will observe how to do parallel ML training in the cloud on clusters to achieve scale.
-
Evaluating ML Models: The key to evaluating performance of an ML model is to be able to generate various metrics (depending on the type of algorithm), and to be able to persist measurements in the ML Project Life Cycle. You will see how model metrics, and metrics related to the compute cluster are persisted in Amazon S3 and AWS CloudWatch.
-
Optimizing ML Models: ML Optimization is a stochastic process. You will experiment with Automated Hyperparameter Tuning using Bayesian Search strategy to find the best performing hyperparameters.
-
Framing the ML Problem: Infrastructure Requirements & Business Context
This workshop is part of a series. Make sure you've attended the first workshop.
- Machine Learning Theory (2 Hour Workshop). Webinar Link [TBD]
- Python – you don't need to be an expert python programmer, but you do need to know the basics. If you don't, the official Python tutorial is a good place to start.
- Scientific Python – We will be using a few popular python libraries, in particular NumPy, matplotlib and pandas. If you are not familiar with these libraries, you should probably start by going through the tutorials in the Tools section (especially NumPy).
- Statistics – We will also use some notions of Linear Statistics and Probability theory. You should be able to follow along if you learned these in the past as it won't be very advanced, but if you don't know about these topics or you need a refresher.
This material is not a self-service document (yet). A lot of the key messages, that will be given in the session are not reflected in the document yet and will be articulated by the AWS presenter. Later revisions will improve.
This material is a Level 300 document. It assumes basic experince on AWS (S3, IAM, SageMaker). (e.g. screenshots of SageMaker console).
- AWS Account: Bring your own AWS Account (with admin access to S3, SageMaker, IAM, ECR, Comprehend, Athena)
- **SageMaker Notebook ** – These notebooks are based on SageMaker Notebook running Jupyter. If you just plan to read without running any code, there's really nothing more to know, just keep reading!
Unlike the conventional ML workshop formats, we didn't pursue a single algorithm or a model throughout the workshop. We believe that, experimenting with different ML use cases and Domains prepares the audience better for real-world use cases. Each lab addresses a different use case, dataset; and each notebook has a different focus area of an ML Project Life Cycle.
Module | ML Project Stage | Open Dataset | Big Data / ML Domain | Algorithm | Concepts | Services |
---|---|---|---|---|---|---|
Module 2 | ML Data Engineering | Kaggle Bike Sharing | ETL | Descriptive Statistics | SageMaker | |
Module 3 | ML Modeling on Local Notebook | Kaggle Bike Sharing | Supervised Learning | Linear Regression & Decision Trees | Challenges of ML development on notebooks | SageMaker |
Module 4 | ML Data Engineering | Amazon.com Customer Reviews | Big Data Pipelines & ETL | Bridging the gap with big data & ML with Presto, Hue, HIVE, (Spark) | S3, Athena, Glue, Comprehend | |
Module 5 | ML Modeling on Cloud | Bank - Direct Marketing | Supervised Learning | Binary Classification with Logistic Regression | Benefits of Training in the Cloud | SageMaker |
Module 6 - P I | ML Optimization | Bank - Direct Marketing | ML Metrics for Classification | SageMaker | ||
Module 6 - P II | ML Optimization with Hyper Parameter Tuning | Bank - Direct Marketing | Bayesian Search HPO Strategy | SageMaker | ||
Module 7 | ML Model Deployment | Iris | Model hosting, A/B testing, multi-model endpoints, Auto Scaling | SageMaker |
- Module 1 AWS ML Outline
- Module 2 Local Data Engineering on Bike Sharing Dataset
- Module 3 Local Modeling Supervised Learning Regression on Bike Sharing Dataset
- Module4 Big Data Engineering Amazon on Reviews Dataset
- Module5 ML Modeling SL Binary Classification on Bank Direct Marketing Dataset
- Module6.1 ML Model Optimization Bank Direct_Marketing Hyper Paramerer Tuning
- Module6.2 ML Model Optimization Analyze HyperParameter Tuning
- Module 7 ML Model Deployment into Production for Batch & Real-Time Predictions
As described here: https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html While specifying notebook, select
- `Instance Type` as `m5.xlarge`
- `Additional Configuration -> Volume Size in GB` and enter 5GB
- Add following IAM policies to the IAM role attached to the SageMaker Notebook:
- `AmazonSageMakerFullAccess`
- `AmazonS3FullAccess`
- `AmazonEC2ContainerRegistryReadOnly`
- `AmazonEC2ContainerRegistryFullAccess`
- `AmazonEC2ContainerServiceforEC2Role`
- `AmazonAthenaFullAccess`
- `ComprehendFullAccess`
- Open `vi ~/.bashrc`
- Append following
`
export PS1="\[$(tput setaf 6)\]\u@\h:\w $ \[$(tput sgr0)\]"
export CLICOLOR=1
export LSCOLORS=ExFxCxDxBxegedabagacad
alias ll='ls -lah'
export EDITOR=vim
`
- Do `source ~/.bashrc`
- Do `sudo yum install htop -y`
- Open SageMaker Terminal
- Do `cd SageMaker`
- Clone Lab Guides `git clone https://github.com/CloudaYolla/ArchitectingMLonAWS.git`
- Clone SageMaker examples `git clone https://github.com/awslabs/amazon-sagemaker-examples.git`
Very Important: SageMaker Notebooks run on EC2, and therefore you will be billed by the second unless you save your work (by downloading to your local computer) & terminate the SageMaker notebook instance.
Please
- download the notebook (if you did any changes) to your computer by selecting
File -> Download as -> Notebook (.ipynb)
. - Terminate this instance. Remember that you can always recreate it from the
AWS SageMaker Console
again. - Delete also SageMaker Endpoints
Thank you.