Scaling up Data Science Workflow/Pipeline with Ray Environment Through NOAA-GHCN Dataset
(Optional) Create a conda environment:
conda create -n aws-asdi python=3.7
conda activate aws-asdi
Install dependencies:
pip3 install -r requirements.txt
This project was originally developed on a single node with 32 cores/64 threads and 512GB of memory. A r5.16xlarge EC2 instance will be sufficient enough to run.
(Optional, but recommended) Installing data (~100GB in size):
mkdir data
aws s3 cp --recursive s3://noaa-ghcn-pds/csv/ --no-sign-request data
Run/execute notebook:
jupyter notebook #or jupyter lab
Make sure you have AWS credentials and keys stored and setup correctly, and we can start a cluster from our local machine conveniently through:
ray up autoscaler.yaml
Additionally, make sure the directory paths to your local machine are configured correctly on lines 105
We can connect to the remote EC2 head node via SSH, which Ray does automatically through:
ray attach autoscaler.yaml
The autoscaler script is configured with 1 head node, 4 worker nodes on us-east-1. It is important to make sure the worker nodes have initialized and started properly, otherwise the cluster configuration may fail to recognize worker nodes. To make sure a Ray cluster has started properly, check it by running this command from our local machine and view localhost:8265(There should be 5 nodes):
ray dashboard autoscaler.yaml
or check it through:
ray status #once connected to remote EC2 instance
Then, you should be able to start up the notebook from the remote machine:
jupyter notebook --no-browser #or jupyter lab --no-browser
To run and execute the notebook, port forward the Jupyter notebook to connect it to your local machine's browser (Make sure port numbers match up):
ssh -i <path/to/pem/file> -N -f -L localhost:8888:localhost:8888 ec2-user@<public.ipv4.address>
Once done, you can properly shutdown the cluster by running:
ray down autoscaler.yaml