This repository provides resources for molecular dynamics simulations and other long-running tasks (such as model fine-tuning and hyperparameter tuning) on SaladCloud. It includes blogs, reference designs, benchmarking code, demonstration applications, and test reports.
If you are new to SaladCloud, we recommend starting with the SCE Architectural Overview and the Docker Run on SaladCloud. The tutorial - Build High-Performance Applications shares best practices along with proven insights from customers who have successfully built large-scale AI inference applications and run molecular dynamics simulations, using tens to thousands of Salad GPU nodes.
Long-Running Tasks - Demo App 1
Use Kelpie as the job queue along with its built-in data management.
Long-Running Tasks - Demo App 3
Use Kelpie solely as a job queue, while implementing custom data management (Cloudflare R2 + rclone).
Demo App 3 outperforms Demo App 2 v2 in several key areas:
-
Simplified Architecture: It significantly reduces application complexity by eliminating the need for job and leasing management, resulting in a 30% reduction (600 to 400 lines in Python) in the demo app.
-
Enhanced Task Duration: It resolves the limitation of AWS SQS's maximum 12-hour job execution at a time, enabling seamless support for longer-running tasks on SaladCloud.
Long-Running Tasks - Demo App 2 (v2) (deprecated)
Use AWS SQS as a job queue, while implementing custom data management (Cloudflare R2 + boto3).
This implementation utilizes separate threads for I/O operations (including health checks) and AI inference, enabling efficient handling of concurrent requests with batched inference processing. It can be used for image generation, transcription, and non-streaming LLM tasks.
Benchmarks and best practices for designing a high-performance and cost-effective storage solution for applications on SaladCloud.
Summarize the common challenges while migrating workloads from Hyperscalers to SaladCloud, and best practices for successful application deployments.