This GitHub repository contains resources to deploy the solution described in the HPC Blog for CryoSparc on AWS ParallelCLuster. It was updated September 2024 to align with the latest ParallelCluster configuration file format.
NOTE: This repository was updated to add resources to run CryoSAPRC on (AWS Parallel Computing Service (PCS)), but maintianed the original instructions to run on AWS ParallelCluster. Please see the last section if you'd like to try PCS.
This solution includes the following resources:
- YAML configuration file for AWS ParallelCluster, either:
parallel-cluster-cryosparc.yaml
if your account permissions allow you (and by extension, ParallelCluster on your behalf) to create new IAM Roles and Policies, orparallel-cluster-cryosparc-custom-roles.yaml
if your account permissions restrict the creation of new IAM resources.
- Post-install script to install CryoSPARC
parallel-cluster-post-install.sh
- Policy file that allows automatic data export back to FSx
FSxLustreDataRepoTasksPolicy.yaml
There are several prerequisites to fulfill before deploying this solution. You'll use the outputs of these prerequisites to fill in the values in the AWS ParallelCluster configuration file.
First, you’ll need to request a license from Structura. It can take a day or two to obtain the license, so request it before you get started.
Paste the license id as an input to the ParallelCluster configuration file, replacing the <CRYOSPARC-LICENSE>
placeholder.
A typical use of a default VPC has public and private subnets balanced across multiple Availability Zones (AZs). However, HPC clusters (like ParallelCluster) usually prefer a single-AZ so they can keep communication latency low and use Cluster Placement Groups. For the compute nodes, you can create a large private subnet with a relatively large number of IP addresses. Then, you can create a public subnet with minimal IP addresses, since it will only contain the head node.
HPC EC2 instances like the P4d family aren’t available in every AZ. That means we need to determine which AZ in a given Region has all the compute families we need. We can do that with the AWS CLI describe-instance-type-offerings command. The easiest way to do this is to use CloudShell, which provides a shell environment ready to issue AWS CLI commands in a few minutes. If you want a more permanent development environment for ParallelCluster CLI calls, you can use Cloud9 which persists an IDE environment, including a Terminal in which you can run CLI commands. After you've provisioned the environment is provisioned, copy and paste the text into the shell.
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--region <region> \
--filters Name=instance-type,Values=p4d.24xlarge \
--query "InstanceTypeOfferings[*].Location" \
--output text
Using the output showing which AZs have the compute instances you need, you can create your VPC and subnets. Populate the <REGION>
, <SMALL-PUBLIC-SUBNET-ID>
, and <LARGE-PRIVATE-SUBNET-ID>
inputs in the configuration file.
You’ll also need to create an EC2 SSH key pair so that you can SSH into the head node once your cluster has been deployed, and populate the <EC2-KEY-PAIR-NAME>
input in the configuration file.
Create a new S3 bucket for your input data. Replace the <S3-BUCKET>
placeholders in the ParallelCluster configuration file with the name of your bucket.
The data transfer mechanism to move data from instruments into S3 depends on the connectivity in the lab environment and the volume of data to be transferred. We recommend AWS DataSync, which easily automates secure data transfer from on-premises into the cloud with minimal development effort. Storage Gateway File Gateway is another viable option, especially if lab connectivity is limited or continued two-way access from on-premises to the transferred data sets is required. Both DataSync and Storage Gateway can be bandwidth throttled to protect non-HPC business-critical network constraints.
Alternatively, you can use the AWS S3 CLI to transfer individual files, or use partner solution to get started quickly.
While ParallelCluster creates its own least-privilege roles and policies by default, many Enterprises limit their AWS account users’ access to IAM actions. ParallelCluster also supports using or adding pre-created IAM resources, which you can request to be pre-created for you by your IT services team. The required permissions and roles are provided in the ParallelCluster documentation and use the parallel-cluster-cryosparc-custom-roles.yaml
, which has additional IAM fields, to help you get started quickly.
Note: In ParallelCluster 3.4, the config file interface was updated to accept EITHER S3Access
OR InstanceRole
parameters, but not both. Make sure the roles you create have access to S3 in addition to the policies outlined in the documentation linked above
You can provision your FSx file system as persistent or scratch. Persistent file systems can automatically export data back to Amazon S3, but scratch file systems don’t. The example in this GitHub repo uses scratch, since it is provisioning a benchmark environment rather than a production environment. If you want to integrate a data export task into the ParallelCluster job scheduler so that every time a job completes, a data export is run transparently in the background, this requires additional IAM Policy statements to be attached to the instance profile of the head node. The policy is in the file FSxLustreDataRepoTasksPolicy.yml
. Make sure the role that you’re using to execute your ParallelCluster provisioning includes this policy if you intend to run the export.
Upload the parallel-cluster-cryosparc.yaml (or parallel-cluster-cryosparc-custom-roles.yaml) configuration file (with all of the filled in) and the parallel-cluster-post-install.sh script to your S3 bucket.
We recommend using AWS CloudShell to quickly set up an environment that already has the credentials and command line tools you'll need to get started. The AWS CloudShell Console already has credentials to your AWS account, the AWS CLI, and Python installed. If you're not using CloudShell, make sure you have these installed in your local environment before continuing.
Follow the instructions in the AWS ParallelCluster documentation to install AWS ParallelCluster into a virtual environment
Copy config file from S3
aws s3api get-object --bucket cryosparc-parallel-cluster --key parallel-cluster-cryosparc.yaml parallel-cluster-cryosparc.yaml
If you were starting from scratch, you would run pcluster config to generate a config file. For this solution, we're providing that config file for you, so you can create the cluster immediately using the create-cluster command.
pcluster create-cluster --cluster-name cryosparc-cluster --cluster-configuration parallel-cluster-cryosparc.yaml
Check the status of the cluster creation using the pcluster CLI or using the AWS CloudFormation console
pcluster describe-cluster --cluster-name cryosparc-cluster
Hint: If you're having trouble with the stack rolling back due to a failure provisioning the head node first verify that your public subnet automatically creates Ipv4 addresses and allows DNS. If you're still having issues, re-create the cluster using the --rollback-on-failure false
flag. This will keep CloudFormation from immediately de-provisioning the resources in the cluster. Search for "HeadNode" in the list of Stack resources. Click on the instance ID link. Check the box to the left of the node, and select Actions > Monitor and troubleshoot > Get system log.
Once your cluster has been provisioned, you are ready to continue using AWS ParallelCluster to run your cryoSPARC jobs as described in their documentation!
To clean up your cluster, use ParallelCluster's delete-cluster command to de-provision the underlying resources in your cluster.
pcluster delete-cluster --cluster-name cryosparc-cluster
Once the cluster has been deleted, you can delete the files you uploaded to S3 and the S3 bucket itself, along with the data transfer solution you chose in the prerequisite sections.
AWS will publish a blog post on how to run CryoSPARC on AWS Parallel Computing Service (PCS), a managed service that makes it easier for you to run and scale your high performance computing (HPC) workloads and build scientific and engineering models on AWS using Slurm. You can find the post-install sample code in parallel-computing-service/pcs-cryosparc-post-install.sh
as referenced in the Scalable Cryo-EM on AWS Parallel Computing Service (PCS) blog for installation on the login node. The full architecture of the blog is as follows and can be found on the post.
This library is licensed under the MIT-0 License. See the LICENSE file.