Perceiver IO is a generalization of Perceiver to handle arbitrary outputs and arbitrary inputs. This example shows how to produce multimodal videos with audio using the Kinetics dataset on AWS Trainium and EC2 GPU instances orchestrated by EKS and launched by Karpenter.
It is necessary to plan your application's build-time, deployment-time, and run-time to make it flexible with CPU and AI accelerators. We used PerceiverForMultimodalAutoencoding sample notebook to demonstrate to compile and run the HuggingFace Multimodal Perceiver model to classify and autoencode video inputs on Neuron along with GPU version.
To emphesize the build-time considurations, we started with a standalone instance (trn1n, p4d, g5) that downloads the kinetics datasets to the instance local NVMe SSD storage, prepares the data, train and evaluate a model. Later, we enabled the training to resume from interruptions by storing the dataset and training state on Amazon FSx to resolve data loading and performance bottlenecks. Finally, we use Volcano, a Kubernetes native batch scheduler, to improve training orchestration.
We demonstrate how to simplify the build process by using a single Docker image for Trainuim and GPU instances. We start with amazon-eks-gpu-node AMI on Amazon Linux or Ubuntu. Then we build a Docker image that supports x86/AMD instances such as G5, P4, and Trn/Inf, as well as Graviton-based instances such as G5g. To abstract the AI accelerator chips, we use Python venv. CUDA for P and G instances and Neuron SDK for Trn and Inf instances.
At deploy-time, we use Karpenter to simplify deployment by specifying a single deployment specification for the training job and using Karpenter to prioritize instance types based on availability and cost. For example, we use a Karpenter provisioner for Trn instances and another provisioner for GPU instances. The .spec.weight
indicates the priority we want to set for each instance type. We can set more granular instance types by adding another Karpenter for G5G, which is a Graviton-based instance with NVIDIA T4G Tensor Core.
- Create EKS cluster and deploy Karpenter
- Deploy Karpenter provisioner for Trainuim
kubectl apply -f amd-neuron-provisioner.yaml
- Deploy Karpenter provisioner for NVDIA-based instances
- Deploy FSx for Lustre CSI driver
- Deploy the Volcano CRD
- Deploy Container Insights on Amazon EKS
- Deploy NVIDIA device plugin for Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
- Deploy Neuron device plugin for Kubernetes
kubectl apply -f k8s-neuron-device-plugin-rbac.yml kubectl apply -f k8s-neuron-device-plugin.yml
The build process creates OCI images for x86-based instances. You add another build step to create OCI images for Graviton-based instances. This new build process creates a OCI image manifest list that references both OCI images. The container runtime (Docker Engine or containerd) will pull the correct platform-specific image at deployment time. To automate the OCI image build process, we use AWS CodePipeline. AWS CodePipeline starts by building a OCI image from the code in AWS CodeBuild that is pushed to Amazon Elastic Container Registry (Amazon ECR).
We used the GLUE benchmark for hyperparameters settings (Table 1 - Perceiver IO on language) for SPS = train-time steps per second. M = # inputs and N = # latents
in config/main.yaml and perceiver-trn-job.yaml.
--------TRAINING CONFIG----------
Namespace(batch_size=1, config_file_path='config/main.yaml', dataset='kinetics-small', dataset_dir='/dataset', do_eval=False, drop_last=False, enable_pt_autocast=False, expected_average_throughput=0, image_dim=224, log_steps=4, logdir='log_training', lr=1e-05, max_steps=100, metrics_debug=False, momentum=0.9, num_epochs=1, num_workers=2, target_accuracy=0, test_batch_size=8)
---------------------------------
-
Deploy training job on Trainium
kubectl apply -f ./perceiver-trn-job.yaml
The training includes the following phases:
1/ Download the training dataset
2/ Run
PerceiverForMultimodalAutoencoding
usingtorch.distributed.run
3/ Run
TrainiumTrainer
to prepar datakubectl logs $POD_NAME | grep Preparing data.FLAGS.dataset
Results stored in
~/.torch/vision/datasets/kinetics/
e.g.,/root/.torch/vision/datasets/kinetics/1723303957.pt
Look for
Successfully built the dataset
i.e.,kubectl logs $POD_NAME | grep "Successfully built the dataset"
4/ Training start using neuron devices(
xla
)kubectl logs perceiver-trn-58c77f446c-vprlp | grep -A30 "Compiler status PASS" Compiler status PASS 2023-09-30 19:55:36.000961: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph 2023-09-30 19:55:38.000516: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.10.0.35+3817a0c8c/MODULE_6307988913499384240. 2023-09-30 19:55:38.000517: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.10.0.35+3817a0c8c/MODULE_6307988913499384240/MODULE_1_SyncTensorsGraph.4360_6307988913499384240_perceiver-trn-58c77f446c-vprlp-e2f5ebd3-526-60698ecd528fd/9c321fcf-f62f-4630-819e-e18a0e001854 ....... Compiler status PASS 2023-09-30 19:57:45.000412: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph 2023-09-30 19:57:45.000413: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.10.0.35+3817a0c8c/MODULE_875038235640162619. 2023-09-30 19:57:45.000414: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.10.0.35+3817a0c8c/MODULE_875038235640162619/MODULE_2_SyncTensorsGraph.18963_875038235640162619_perceiver-trn-58c77f446c-vprlp-ee1fe64d-526-60698eced10ee/ee2ad0db-252f-40e9-83fc-a3f751bbd984 .| Training Device=xla:1 Epoch=1 Step=0 Learning_Rate=1e-05 Loss=0.09277 Throughput=8.76648 Time=2023-09-30 19:57:48.654494
-
Deploy training job on G5 or P5 instances TBD
-
Deploy training job on G5g TBD