AWS-ASDI Project

Scaling up Data Science Workflow/Pipeline with Ray Environment Through NOAA-GHCN Dataset

Getting Started

(Optional) Create a conda environment:

conda create -n aws-asdi python=3.7
conda activate aws-asdi

Install dependencies:

pip3 install -r requirements.txt

Running on Single Node

This project was originally developed on a single node with 32 cores/64 threads and 512GB of memory. A r5.16xlarge EC2 instance will be sufficient enough to run.

(Optional, but recommended) Installing data (~100GB in size):

mkdir data
aws s3 cp --recursive s3://noaa-ghcn-pds/csv/ --no-sign-request data

Run/execute notebook:

jupyter notebook #or jupyter lab

Scaling with Multiple Nodes Through AWS Autoscaler

Make sure you have AWS credentials and keys stored and setup correctly, and we can start a cluster from our local machine conveniently through:

ray up autoscaler.yaml

Additionally, make sure the directory paths to your local machine are configured correctly on lines 105

We can connect to the remote EC2 head node via SSH, which Ray does automatically through:

ray attach autoscaler.yaml

The autoscaler script is configured with 1 head node, 4 worker nodes on us-east-1. It is important to make sure the worker nodes have initialized and started properly, otherwise the cluster configuration may fail to recognize worker nodes. To make sure a Ray cluster has started properly, check it by running this command from our local machine and view localhost:8265(There should be 5 nodes):

ray dashboard autoscaler.yaml

or check it through:

ray status #once connected to remote EC2 instance

Then, you should be able to start up the notebook from the remote machine:

jupyter notebook --no-browser #or jupyter lab --no-browser

To run and execute the notebook, port forward the Jupyter notebook to connect it to your local machine's browser (Make sure port numbers match up):

ssh -i <path/to/pem/file> -N -f -L localhost:8888:localhost:8888 ec2-user@<public.ipv4.address>

Once done, you can properly shutdown the cluster by running:

ray down autoscaler.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
autoscaler.yaml		autoscaler.yaml
eda.ipynb		eda.ipynb
requirements.txt		requirements.txt
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS-ASDI Project

Getting Started

Running on Single Node

Scaling with Multiple Nodes Through AWS Autoscaler

About

Releases

Packages

Languages

License

briancpark/aws-asdi

Folders and files

Latest commit

History

Repository files navigation

AWS-ASDI Project

Getting Started

Running on Single Node

Scaling with Multiple Nodes Through AWS Autoscaler

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages