Skip to content

Commit

Permalink
feat: Support debugging webhooks locally
Browse files Browse the repository at this point in the history
  • Loading branch information
shalousun committed Aug 17, 2024
1 parent 3f9b0a4 commit 2044b41
Showing 1 changed file with 89 additions and 4 deletions.
93 changes: 89 additions & 4 deletions docs/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ Kubeflow Training Operator is currently at v1.
- [Python](https://www.python.org/) (3.11 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
- [Kind](https://kind.sigs.k8s.io/) (0.22.0 or later)
- [Lima](https://github.com/lima-vm/lima?tab=readme-ov-file#adopters) (an alternative to DockerDesktop) (0.21.0 or later)
- [Colima](https://github.com/abiosoft/colima) (Lima specifically for MacOS) (0.6.8 or later)
- [Lima](https://github.com/lima-vm/lima?tab=readme-ov-file#adopters) (an alternative to DockerDesktop) (0.21.0 or
later)
- [Colima](https://github.com/abiosoft/colima) (Lima specifically for MacOS) (0.6.8 or later)
- [pre-commit](https://pre-commit.com/)

Note for Lima the link is to the Adopters, which supports several different container environments.
Expand Down Expand Up @@ -49,38 +50,50 @@ Running the operator locally (as opposed to deploying it on a K8s cluster) is co
First, you need to run a Kubernetes cluster locally. We recommend [Kind](https://kind.sigs.k8s.io).

You can create a `kind` cluster by running

```sh
kind create cluster
```

This will load your kubernetes config file with the new cluster.

After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.

```sh
kubectl get nodes
```

The output should look something like below:

```
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 32s v1.27.3
```

Note, that for the example job below, the PyTorchJob uses the `kubeflow` namespace.

From here we can apply the manifests to the cluster.

```sh
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
```

Then we can patch it with the latest operator image.

```sh
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
```

Then we can run the job with the following command.

```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
And we can see the output of the job from the logs, which may take some time to produce but should look something like below.

And we can see the output of the job from the logs, which may take some time to produce but should look something like
below.

```
$ kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --follow
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Expand Down Expand Up @@ -112,12 +125,15 @@ Now that you confirmed you can spin up an operator locally, you can try to test
You do this by building a new operator image and loading it into your kind cluster.

### Build Operator Image

```sh
make docker-build IMG=my-username/training-operator:my-pr-01
```

You can swap `my-username/training-operator:my-pr-01` with whatever you would like.

## Load docker image

```sh
kind load docker-image my-username/training-operator:my-pr-01
```
Expand All @@ -128,24 +144,93 @@ kind load docker-image my-username/training-operator:my-pr-01
cd ./manifests/overlays/standalone
kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
```

Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.

Deploy the operator with:

```sh
kubectl apply -k ./manifests/overlays/standalone
```

And now we can submit jobs to the operator.

```sh
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "my-username/training-operator:my-pr-01"}]'
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```

You should be able to see a pod for your training operator running in your namespace using

```
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
```

## Testing changes locally without build image

Building and testing changes through container images can be time-consuming, so here is a simpler method that allows you
to start and test directly through the command line or your development tools. Note that this approach is effective only
for clusters created with Kind on your local machine (e.g., on a Mac).

### Install cert-manager and generate a certificate

Deploy cert-manager to manage the webhook's certificate:

```sh
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.3/cert-manager.yaml
```

To generate a certificate for local debugging of webhooks using cert-manager, create a certificate.yaml file with the
following content:

```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: kubeflow
spec:
selfSigned: { }

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: serving-cert # This name should match the one appearing in kustomizeconfig.yaml
namespace: kubeflow
spec:
# $(SERVICE_NAME) and $(SERVICE_NAMESPACE) will be substituted by kustomize
dnsNames:
- $(SERVICE_NAME).$(SERVICE_NAMESPACE).svc
- $(SERVICE_NAME).$(SERVICE_NAMESPACE).svc.cluster.local
- host.docker.internal # host.docker.internal is the hostname for Docker Desktop on macOS
ipAddresses: # New configuration about node IP addresses
- "172.17.0.1" # IP address for Docker on Linux
issuerRef:
kind: Issuer
name: selfsigned-issuer
secretName: webhook-server-cert # This secret will not be prefixed, since it's not managed by kustomize
```
Create the certificate:
```
kubectl apply -f certificate.yaml
```

The generated `tls.*` files need to be stored in the `/tmp/k8s-webhook-server/serving-certs` directory.

```sh
kubectl get secret -n kubeflow webhook-server-cert -o=jsonpath='{.data.tls\.key}' | base64 -d >${TMPDIR}/k8s-webhook-server/serving-certs/tls.key
kubectl get secret -n kubeflow webhook-server-cert -o=jsonpath='{.data.tls\.crt}' | base64 -d >${TMPDIR}/k8s-webhook-server/serving-certs/tls.crt

```

## Go version

On ubuntu the default go package appears to be gccgo-go which has problems see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang tarballs instead.
On ubuntu the default go package appears to be gccgo-go which has problems
see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang
tarballs instead.

## Generate Python SDK

Expand Down

0 comments on commit 2044b41

Please sign in to comment.