Skip to content

Run your deep learning training workloads on Kubernetes

License

Notifications You must be signed in to change notification settings

littletiger123/kubedl

 
 

Repository files navigation

KubeDL

License Build Status

KubeDL is short for Kubernetes-Deep-Learning. It is a unified operator that supports running multiple types of distributed deep learning/machine learning workloads on Kubernetes. Check the website: https://kubedl.io


Currently, KubeDL supports the following ML/DL jobs:

Features

  • Support running prevalent DeepLearning workloads in a single operator.
  • Support running jobs with custom artifacts from remote repository such as github, saving users from manually baking the artificats into the image.
  • Instrumented with unified prometheus metrics for different types of DL jobs, such as job launch delay, number of pending/running jobs.
  • Support job metadata persistency with a pluggable storage backend such as Mysql.
  • Provide more granular information on kubectl command line to show job status.
  • Support advanced scheduling features such as gang scheduling with pluggable backend schedulers.
  • A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.
  • Run jobs with Host network.

Build right away

make manager

Run the tests

make test

Generate manifests e.g. CRD, RBAC YAML files etc

make manifests

Build the docker image

export IMG=<your_image_name> && make docker-build

Push the image

docker push <your_image_name>

Check the Makefile in the root directory for more details.

About

Run your deep learning training workloads on Kubernetes

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 99.3%
  • Other 0.7%