Kubeflow Operator helps deploy, monitor and manage the lifecycle of Kubeflow. Built using the Operator Framework which offers an open source toolkit to build, test, package operators and manage the lifecycle of operators.
The Operator is currently in incubation phase and is based on this design doc. It is built on top of KfDef CR, and uses kfctl as the nucleus for Controller. Current roadmap for this Operator is listed here. The Operator is also published on OperatorHub.
- Install kustomize
- Clone this repository, build the manifests and install the operator
git clone https://github.com/kubeflow/kfctl.git && cd kfctl
export OPERATOR_NAMESPACE=operators
kubectl create ns ${OPERATOR_NAMESPACE}
cd deploy/
kustomize edit set namespace ${OPERATOR_NAMESPACE}
# kustomize edit add resource kustomize/include/quota # only deploy this if the k8s cluster is 1.15+ and has resource quota support, which will allow only one _kfdef_ instance or one deployment of Kubeflow on the cluster. This follows the singleton model, and is the current recommended and supported mode.
kustomize build | kubectl apply -f -
- Deploy KfDef
KfDef can point to a remote URL or to a local kfdef file. To use the set of default kfdefs from Kubeflow, follow the Deploy with default kfdefs section below.
KUBEFLOW_NAMESPACE=kubeflow
kubectl create ns ${KUBEFLOW_NAMESPACE}
kubectl create -f <kfdef> -n ${KUBEFLOW_NAMESPACE}
To use the set of default kfdefs from Kubeflow, you will have to insert the metadata.name
field before you can apply it to Kubernetes. Below are the commands for applying the Kubeflow kfdef using Operator. For e.g. for IBM Cloud, commands will be
If you are pointing the kfdef file on the local machine, set the
KFDEF
to the kfdef file path and skip thecurl
command.
First point to your Cloud provider kfdef. For e.g. for OpenShift, point to the kfdef in OpenDataHub repo
export KFDEF_URL=https://raw.githubusercontent.com/opendatahub-io/manifests/v0.7-branch-openshift/kfdef/kfctl_openshift.yaml
Similary for GCP, IBM Cloud etc. you can point to the respective kfdefs in Kubeflow repository, e.g.
export KFDEF_URL=https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_ibm.yaml
Then specify the KUBEFLOW_DEPLOYMENT_NAME
you want to give to your deployment. Please note that currently multi-user deployments have a hard dependency on using kubeflow
as the deployment name.
export KUBEFLOW_DEPLOYMENT_NAME=kubeflow
export KFDEF=$(echo "${KFDEF_URL}" | rev | cut -d/ -f1 | rev)
curl -L ${KFDEF_URL} > ${KFDEF}
Next, we need to update the KFDEF file with the KUBEFLOW_DEPLOYMENT_NAME. We strongly recommend to install the yq tool and run the yq
command. However, if you can't install yq
, you can run the perl
command to do the same thing assuming you are using one of the kfdefs under the manifests repository.
yq w ${KFDEF} 'metadata.name' ${KUBEFLOW_DEPLOYMENT_NAME} > ${KFDEF}.tmp && mv ${KFDEF}.tmp ${KFDEF}
# perl -pi -e $'s@metadata:@metadata:\\\n name: '"${KUBEFLOW_DEPLOYMENT_NAME}"'@' ${KFDEF}
Lastly, deploy the kfdef resource to the cluster.
kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}
One of the major benefits of using kfctl as an Operator is to leverage the functionalities around being able to watch and reconcile your Kubeflow deployments. The Operator is watching on any cluster events for the KfDef instance, as well as the Delete event for all the resources whose owner is the KfDef instance. Each of such events is queued as a request for the reconciler to apply changes to the KfDef instance. For example, if one of the Kubeflow resources is deleted, the reconciler will be triggered to re-apply the KfDef instance, and re-create the deleted resource on the cluster. Therefore, the Kubeflow deployment with this KfDef instance will recover automatically from the unexpected delete event.
Try following to see the operator watcher and reconciler in action:
- Check the tf-job-operator deployment is running
kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME READY UP-TO-DATE AVAILABLE AGE
# tf-job-operator 1/1 1 1 7m15s
- Delete the tf-job-operator deployment
kubectl delete deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# deployment.extensions "tf-job-operator" deleted
- Wait for 10 to 15 seconds, then check the tf-job-operator deployment again
You will be able to see that the deployment is being recreated by the Operator's reconciliation logic.
kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME READY UP-TO-DATE AVAILABLE AGE
# tf-job-operator 0/1 0 0 10s
The Kubeflow operator also support multiple KfDef instances deployment. It watches over all the KfDef instances and handles reconcile requests to all the KfDef instances. To understand more on the operator controller behavior, refer to this controller-runtime link.
The operator responds to following events:
-
When a KfDef instance is created or updated, the operator's reconciler will be notified of the event and invoke the
Apply
function provided by thekfctl
package to deploy Kubeflow. The Kubeflow resources specified with the manifests will be added with the following annotation to indicate that they are owned by this KfDef instance.annotations: kfctl.kubeflow.io/kfdef-instance: <kfdef-name>.<kfdef-namespace>
-
When a KfDef instance is deleted, the operator's reconciler will be notified of the event and invoke the finalizer to run the
Delete
function provided by thekfctl
package and go through all applications and components owned by the KfDef instance. -
When any resource deployed as part of a KfDef instance is deleted, the operator's reconciler will be notified of the event and invoke the
Apply
function provided by thekfctl
package to re-deploy Kubeflow. The deleted resource will be recreated with the same manifest which was specified when the KfDef instance was created.
- Delete Kubeflow deployment, the KfDef instance
kubectl delete kfdef -n ${KUBEFLOW_NAMESPACE} --all
Note that the users profile namespaces created by
profile-controller
will not be deleted. The${KUBEFLOW_NAMESPACE}
created outside of the operator will not be deleted either.
- Delete Kubeflow Operator
kubectl delete -f deploy/operator.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete clusterrolebinding kubeflow-operator
kubectl delete -f deploy/service_account.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete -f deploy/crds/kfdef.apps.kubeflow.org_kfdefs_crd.yaml
kubectl delete ns ${OPERATOR_NAMESPACE}
Please follow the instructions here to register your Operator to OLM if you are using that to install and manage the Operator. If you want to leverage the OperatorHub, please use the default Kubeflow Operator registered there
- When deleting the Kubeflow deployment, some mutatingwebhookconfigurations resources are cluster-wide resources and may not be removed as their owner is not the KfDef instance. To remove them, run following:
kubectl delete mutatingwebhookconfigurations admission-webhook-mutating-webhook-configuration
kubectl delete mutatingwebhookconfigurations inferenceservice.serving.kubeflow.org
kubectl delete mutatingwebhookconfigurations istio-sidecar-injector
kubectl delete mutatingwebhookconfigurations katib-mutating-webhook-config
kubectl delete mutatingwebhookconfigurations mutating-webhook-configurations
-
Install operator-sdk
-
Install golang
-
Install kustomize
These steps are based on the operator-sdk with modifications that are specific for this Kubeflow operator.
- Clone this repository under your
$GOPATH
. (e.g.~/go/src/github.com/kubeflow/
)
git clone https://github.com/kubeflow/kfctl
cd kfctl
- Build and push the operator
export OPERATOR_IMG=<docker_repo>
make build-operator
make push-operator
Note: replace <docker_repo> with the image repo name and tag, for example,
docker.io/example/kubeflow-operator:latest
.
- Follow Deployment Instructions section to test the operator with the newly built image
Kubeflow Operator controller logic is based on the kfctl
package, so for each major release of kfctl
, an operator image is built and tested with that version of manifests
to deploy a KfDef instance. Following table shows what releases have been tested.
branch tag | operator image | manifests version | kfdef example | note |
---|---|---|---|---|
v1.0 | aipipeline/kubeflow-operator:v1.0.0 | 1.0.0 | kfctl_k8s_istio.v1.0.0.yaml | |
v1.0.1 | aipipeline/kubeflow-operator:v1.0.1 | 1.0.1 | kfctl_k8s_istio.v1.0.1.yaml | |
v1.0.2 | aipipeline/kubeflow-operator:v1.0.2 | 1.0.2 | kfctl_k8s_istio.v1.0.2.yaml | |
master | aipipeline/kubeflow-operator:master | master | kfctl_k8s_istio.yaml | as of 05/15/2020 |
Note: if building a customized operator for a specific version of Kubeflow is desired, you can run
git checkout
to that specific branch tag. Keep in mind to use the matching version of manifests.