Originally repo details in this - https://medium.com/@datitran/quickstart-pyspark-with-anaconda-on-aws-660252b88c9a
- Bind PySpark with Anaconda2 on AWS.
- Setup Zeppelin and Jupyter to use Anaconda2 on AWS.
- Setup S3 backed zeppelin notebooks.
- Setup Hive metastore to MySQL RDS instance.
- Install required python libraries in all the nodes of the cluster.
cd emr-bootstrap-pyspark
&&conda env create -f environment.yml
- Fill in all the required information e.g. aws access key, secret acess key etc. into the
config.yml.example
file and rename it toconfig.yml
- Run it in the newly created environment -
source activate emr-bootstrap-pyspark
&&python emr_loader.py
- Make sure to have created the default roles in the AWS account by running
aws emr create-default-roles
in AWS CLI