KubeDL is short for Kubernetes-Deep-Learning. It is a unified operator that supports running multiple types of distributed deep learning/machine learning workloads on Kubernetes. Check the website: https://kubedl.io
Currently, KubeDL supports the following ML/DL jobs:
- TensorFlow
- PyTorch
- XGBoost
- XDL
- Mars
- MPI Job
- Support running prevalent DeepLearning workloads in a single operator.
- Support running jobs with custom artifacts from remote repository such as github, saving users from manually baking the artificats into the image.
- Instrumented with unified prometheus metrics for different types of DL jobs, such as job launch delay, number of pending/running jobs.
- Support job metadata persistency with a pluggable storage backend such as Mysql.
- Provide more granular information on kubectl command line to show job status.
- Support advanced scheduling features such as gang scheduling with pluggable backend schedulers.
- A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.
- Run jobs with Host network.
make manager
make test
make manifests
export IMG=<your_image_name> && make docker-build
docker push <your_image_name>
Check the Makefile
in the root directory for more details.