nix-ml-ops is a collection of flake parts for setting up a machine learning development environment and deploying machine learning jobs and services onto cloud platforms.
See options documentation for all available options.
Here is an example of the ml-ops config. In the example: ."${"key"}".
denotes a name picked by the user, while ."key".
denotes a union, i.e. there are multiple choices. The notations are used for documenting purpose, and they are all identical to simple .key.
in Nix language.
{
ml-ops = {
# Common volumes shared between devcontainer and jobs
common.volumeMounts."nfs"."${"/mnt/ml-data/"}". = {
server = "my-server.example.com";
path = "/ml_data";
};
# Common environment variables shared between devcontainer and jobs
common.environmentVariables = {};
# Environment variables in additionto the ml-ops.common.environmentVariables
devcontainer.environmentVariables = {
}
devcontainer.volumeMounts = {
# Volumes in addition to the ml-ops.common.volumeMounts
"emptyDir"."${"/mnt/my-temporary-data/"}" = {
medium = "Memory";
};
};
# TODO: Support elastic
# jobs."${"training"}".resources."elastic"
# This is the configuration for single node training orstatic distributed
# training, not for elastic distributed training
jobs."${"training"}".resources."static".accelerators."A100" = 16;
jobs."${"training"}".resources."static".cpus = 32;
jobs."${"training"}".run = ''
torchrun ...
'';
# Environment variables in additionto the ml-ops.common.environmentVariables
jobs."${"training"}".environmentVariables = {
HF_DATASETS_IN_MEMORY_MAX_SIZE = "200000000";
};
# Volumes in addition to the ml-ops.common.volumeMounts
jobs."${"training"}".volumeMounts = {};
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".imageRegistry.host = "us-central1-docker.pkg.dev/ml-solutions-371721/training-images";
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".namespace = "default";
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".aks = {
cluster = "mykubernetescluster";
resourceGroup = "myresourcegroup";
registryName = "mycontainerregistry";
};
# TODO: Other types of launcher
# jobs."${"training"}".launchers."${"my-aws-ec2-launcher"}"."skypilot" = { ... };
# Extra package available in both runtime and development environment:
pythonEnvs."pep508".common.extraPackages."${"peft"}"."buildPythonPackage".src = peft-src;
# Extra packages available in development environment only:
pythonEnvs."pep508".development.extraPackages = {};
# TODO: Support poetry projects:
# pythonEnvs."poetry" = { ... };
};
}
Then, run the following command to start the job:
nix build .#training-my-aks-launcher-helmUpgrade
The command will internally do the following things:
- Build an image including a Python script with the environment of dependencies specified in
requirements.txt
. - Push the image to Azure Container Registry
mycontainerregistry.azurecr.io
- Create a Helm chart including job to run the image
- Upgrade the Helm chart on AKS cluster
mykubernetescluster
in resource groupmyresourcegroup
This repository also includes packages to build VM images as a NixOS based devserver.
nix build .#devserver-gce
nix build .#devserver-amazon
# Azure Image Generation 1
nix build .#devserver-azure
# Azure Image Generation 2
nix build .#devserver-hyperv
Note that KVM must be enabled on the devserver. See this document for enabling KVM on GCE.
Also the following steps are required on Debian to install kvm kernel modules:
sudo apt-get install qemu-kvm
sudo tee -a /etc/nix/nix.conf <<EOF
extra-experimental-features = nix-command flakes
extra-system-features = kvm
EOF
nix run .#upload-devserver-gce-image
Note that in order to upload the built image, the nix run
command must be executed in a GCP VM whose service account has the permission to upload image, or it is executed after a successful gcloud auth login
.
nix run .#upload-devserver-azure-image
nix run .#upload-devserver-azure-hyperv
Note that in order to upload the built image, the nix run
command must be executed in an Azure VM whose Identity has the permission to upload image, or it is executed after a successful az login
.
If you already checked out this repository, run the following command in the work tree: For VM on GCE:
sudo nixos-rebuild switch --flake .#devserverGce
For Azure VM:
sudo nixos-rebuild switch --flake .#devserverAzure
Or under an abitrary path, run
sudo nixos-rebuild switch --flake github:Preemo-Inc/nix-ml-ops#devserverGce
or
sudo nixos-rebuild switch --flake github:Preemo-Inc/nix-ml-ops#devserverAzure