Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.
/ aws-asdi Public archive

Scaling up Data Science Workflow/Pipeline with Ray Environment Through NOAA-GHCN Dataset

License

Notifications You must be signed in to change notification settings

briancpark/aws-asdi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS-ASDI Project

Scaling up Data Science Workflow/Pipeline with Ray Environment Through NOAA-GHCN Dataset

Getting Started

(Optional) Create a conda environment:

conda create -n aws-asdi python=3.7
conda activate aws-asdi

Install dependencies:

pip3 install -r requirements.txt

Running on Single Node

This project was originally developed on a single node with 32 cores/64 threads and 512GB of memory. A r5.16xlarge EC2 instance will be sufficient enough to run.

(Optional, but recommended) Installing data (~100GB in size):

mkdir data
aws s3 cp --recursive s3://noaa-ghcn-pds/csv/ --no-sign-request data

Run/execute notebook:

jupyter notebook #or jupyter lab

Scaling with Multiple Nodes Through AWS Autoscaler

Make sure you have AWS credentials and keys stored and setup correctly, and we can start a cluster from our local machine conveniently through:

ray up autoscaler.yaml

Additionally, make sure the directory paths to your local machine are configured correctly on lines 105

We can connect to the remote EC2 head node via SSH, which Ray does automatically through:

ray attach autoscaler.yaml

The autoscaler script is configured with 1 head node, 4 worker nodes on us-east-1. It is important to make sure the worker nodes have initialized and started properly, otherwise the cluster configuration may fail to recognize worker nodes. To make sure a Ray cluster has started properly, check it by running this command from our local machine and view localhost:8265(There should be 5 nodes):

ray dashboard autoscaler.yaml

or check it through:

ray status #once connected to remote EC2 instance

Then, you should be able to start up the notebook from the remote machine:

jupyter notebook --no-browser #or jupyter lab --no-browser

To run and execute the notebook, port forward the Jupyter notebook to connect it to your local machine's browser (Make sure port numbers match up):

ssh -i <path/to/pem/file> -N -f -L localhost:8888:localhost:8888 ec2-user@<public.ipv4.address>

Once done, you can properly shutdown the cluster by running:

ray down autoscaler.yaml

About

Scaling up Data Science Workflow/Pipeline with Ray Environment Through NOAA-GHCN Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published