issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

pavars · 2024-11-08T07:49:18Z

Hi,

Trying to deploy self-hosted ZROK with openziti. The idea and the product seems nice but there is clear lack of documentation for properly secured and working configuration and seems like it is still in PoC stages. First of all helm charts don't support adding extraEnv variables from Secret Mounts (We are using external-secrets operator that pulls in secret data from GCP Secret Manager so we don't expose keys and secrets in plaintext manifests) which means that enrollmentJWT, ziti admin secret needs to be passed into helm chart as plaintext values.
Secondly, the helm hooks which create users, etc sometimes misbehave when deploying with ArgoCD. Having so many configuration issues that I keep redeploying zrok/ ziti and the users get created as part of bootstrap process and this leads to config drift from secrets/ ziti controller. Biggest issue is that secrets get regenerated and what is written in Ziti controller database doesn't match up to what is in the K8S secrets so initial login doesn't work. I see that there is support for postgres database for Zrok, so it can be scaled horizontally and still retain the same data, however the config part responsible for data store doesn't provide any flexibility to make required changes, it is hardcoded to use sqlite3. Also it is also unclear wether ziti-router is needed or is it enough to set ziti-controller-edge api as LoadBalancer service (docs say one thing but after testing it, conclusion is that ziti-router is required).

I tried mounting enrollmentJWT as additionalVolume and set .Values.enrollJwtFile to the mounted volume but that fails miserably. I can see and read the mounted token on the pod filesystem but for some reason Zrok controller fails, fallback to setting the same token explicitly in enrollmentJwt works fine. I might be wrong but feels like enrollmentJwt for ziti-router could also be bootstrapped from a script, so there is no need to manually login to ziti controller and create the router.

Another problem I ran into was creating new identities configs. When Zrok initially starts it tries to bootstrap and create required identity - ran into issue that identity "public" already exists so I had to manually drop the identity and create again. Additionally the new identity was created with ID -D3xLHGw2 and when zrok frontend tried to start it was failing because it doesn't recognise configuration flag "-D3xLHGw2" passed on cli, this needs some proper escaping as seems like this is one of edge cases. In a DR scenario when these resources would be recreated, then all persistent data would be lost and all clients would have to reauthenticate with new tokens/ passwords, correct me if I'm wrong.

The initial config might be good enough to server Zrok/ ziti locally but it is far from production-ready or even just to serve dev resources on GKE cluster.

Below added our current config for helm chart, however we will probably have to keep our own version of these helm charts since they seem to be lacking vital configuration options. My only concern is with maintaining scripts which are called for bootstrap etc. I could open a PR for helm charts to include support for mount envFrom: secrets/ configmaps properly if existingSecret is defined and also option to configure zrok ctrl.yaml with postgres DB.

--- ziti-controller values
        clientApi:
          advertisedHost: ziti-controller-client.ziti
          advertisedPort: 443
          service:
            enabled: true
            type: ClusterIP
          ingress:
            enabled: false
        ctrlPlaneCasBundle:
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: "ziti"
        trust-manager:
          enabled: true
          crds:
            enabled: true
          app:
            trust:
              namespace: ziti
              
--- ziti-router values
        advertisedHost: ziti.dev.company
        # enrollJwtFile: /etc/enrollment-jwt/enrollmentJwt # NOT working
        enrollmentJwt: plainTextToken
        edge:
          advertisedPort: 443
          service:
            enabled: true
            type: ClusterIP
        ctrl:
          endpoint: ziti-controller-ctrl.ziti:443
        tunnel:
          mode: host
        additionalVolumes:
          - name: enrollment-jwt
            volumeType: secret
            mountPath: /etc/enrollment-jwt
            secretName: ziti-router-secret
        env:
          DEBUG: "1"
          
--- ziti-router IngressRouteTCP (for Traefik)
            apiVersion: traefik.containo.us/v1alpha1
            kind: IngressRouteTCP
            metadata:
              name: ziti-router-ingress-dev
              namespace: ziti
              annotations:
                kubernetes.io/ingress.class: traefik-public
            spec:
              entryPoints:
                - websecure
              routes:
              - match: HostSNI(`ziti.dev.company`)
                services:
                  - name: ziti-router-edge
                    port: 443
              tls:
                passthrough: true

--- zrok values
        controller:
          ingress:
            enabled: true
            className: "traefik-public"
            hosts:
              - "zrok.dev.company"
            tls:
               - secretName: wildcard-cert
                 hosts:
                   - zrok.dev.company
        frontend:
          ingress:
            enabled: true
            className: "traefik-public"
            tls:
               - secretName: wildcard-cert
        ziti:
          advertisedHost: ziti-controller-client.ziti
          password: plainTextPassword

        dnsZone: "zrok.dev.company"

The text was updated successfully, but these errors were encountered:

qrkourier · 2024-11-12T21:02:25Z

the new identity was created with ID -D3xLHGw2 and when zrok frontend tried to start it was failing because it doesn't recognise configuration flag "-D3xLHGw2" passed on cli, this needs some proper escaping as seems like this is one of edge cases

Pull request to mitigate leading hyphens in Ziti ID strings - #274

Issue to raise concern about the underlying problem - openziti/ziti#2534

qrkourier · 2024-11-12T21:12:04Z

helm charts don't support adding extraEnv variables from Secret Mounts (We are using external-secrets operator that pulls in secret data from GCP Secret Manager so we don't expose keys and secrets in plaintext manifests) which means that enrollmentJWT, ziti admin secret needs to be passed into helm chart as plaintext values

I see that there is support for postgres database for Zrok, so it can be scaled horizontally and still retain the same data, however the config part responsible for data store doesn't provide any flexibility to make required changes, it is hardcoded to use sqlite3.

I could open a PR for helm charts to include support for mount envFrom: secrets/ configmaps properly if existingSecret is defined and also option to configure zrok ctrl.yaml with postgres DB.

That would be most welcome. ☺️

There's a pattern in the ziti-controller and ziti-router charts for mounting additional volumes, but you may have a better way already established you could use for extraEnv vars from secret mounts or existing identities, or both.

Another user reported this issue too in #273

My only concern is with maintaining scripts which are called for bootstrap etc.

That's understandable. The scripts for zrok controller and zrok frontend show a bias for simplicity at the cost of flexibility. If we need significantly more flexibility, it would be wise to consider if there's another approach better suited than shell scripts. I'm reluctant to try to accomplish too much with shell scripts because they can become quite challenging to maintain.

qrkourier · 2024-11-12T22:06:45Z

Secondly, the helm hooks which create users, etc sometimes misbehave when deploying with ArgoCD. ...snip... Biggest issue is that secrets get regenerated and what is written in Ziti controller database doesn't match up to what is in the K8S secrets so initial login doesn't work.

Do you have any theories how this breaks with ArgoCD? Does ArgoCD significantly depart from the typical workflow of running the helm CLI to render templates and call KubeAPI to create the Helm Release?

Zrok initially starts it tries to bootstrap and create required identity - ran into issue that identity "public" already exists

This sounds like it could stem from the same root. I haven't seen the problem you're describing myself, so I'm guessing it could be related to how ArgoCD works.

Here's the part of bootstrap-frontend.bash that atomically provisions the first zrok account if the account token secret does not already exist.

        # granted permission to read secrets in namespace by SA managed by this chart
        if kubectl -n {{ .Release.Namespace }} get secret \
            {{ include "zrok.fullname" . }}-ziggy-account-token &>/dev/null; then
            echo "INFO: ziggy account enable token secret exists"
        else
            echo "INFO: ziggy account enable token secret does not exist, creating secret"
            # create a default user account named "ziggy" and save the enable token in a Secret resource
            zrok admin create account \
                ziggy@{{ .Values.dnsZone }} \
                {{ $ziggyPassword | b64dec | quote }} \
            | xargs -I TOKEN kubectl -n {{ .Release.Namespace }} create secret generic \
                {{ include "zrok.fullname" . }}-ziggy-account-token \
                --from-literal=token=TOKEN
            # xargs -r is NOT used here because this command must fail loudly if the account token was not created
        fi

And, here's the part of that script that creates the zrok "public" frontend if it does not already exist in Ziti.

        # if default "public" frontend already exists
        ZROK_PUBLIC_TOKEN=$(getZrokPublicFrontend token)
        if [[ -n "${ZROK_PUBLIC_TOKEN}" ]]; then
            
            # ensure the Ziti ID of the public frontend's identity is the same in Ziti and zrok
            ZROK_PUBLIC_ZID=$(getZrokPublicFrontend zid)
            if [[ "${ZITI_PUBLIC_ID}" != "${ZROK_PUBLIC_ZID}" ]]; then
                echo "ERROR: existing Ziti Identity named 'public' with id '$ZITI_PUBLIC_ID' is from a previous zrok"\
                "instance life cycle. Delete it then re-run zrok." >&2
                exit 1
            fi

            echo "INFO: updating frontend"
            zrok admin update frontend "${ZROK_PUBLIC_TOKEN}" \
                --url-template "{{ .Values.frontend.ingress.scheme }}://{token}.{{ .Values.dnsZone }}"
        else
            echo "INFO: creating frontend"
            zrok admin create frontend -- "${ZITI_PUBLIC_ID}" public \
                "{{ .Values.frontend.ingress.scheme }}://{token}.{{ .Values.dnsZone }}"
        fi

qrkourier · 2024-11-12T22:15:35Z

it is also unclear wether ziti-router is needed or is it enough to set ziti-controller-edge api as LoadBalancer service

A zrok instance requires a Ziti network, and a Ziti network requires at least one router and controller. The router(s) and controller(s) are typically separate deployments, and we're starting to explore using StatefulSets to describe multi-router and multi-controller deployments.

feels like enrollmentJwt for ziti-router could also be bootstrapped from a script, so there is no need to manually login to ziti controller and create the router.

I was thinking the same thing but never finished working on that branch. I like the idea of the Ziti controller immediately creating a first router named like "default" or "public" and storing the enrollment token in a K8S secret to simplify the router deployment that typically follows on its heels. Another option in mind is a separate umbrella chart like "ziti-stack" that orchestrates the router enrollment parcel to the controller deployment. That might work, but an Operator feels like the better tool for the job of automating life cycle, ops, etc.

qrkourier · 2024-11-12T22:30:12Z

set .Values.enrollJwtFile to the mounted volume but that fails

Now I see enrollJwtFile is obsolete. Whatever strategy emerges for mounting extra secrets will easily adapt to meet the same need that value must've originally met. For example, if input value existingSecretEnrollmentJwt is passed, then the template should mount that secret on a predictable path and use it during enrollment.

pull request to prune the obsolete value: #275

pavars · 2024-11-20T13:29:22Z

Sorry for the late response, after some fiddling around managed to start zrok with ziti. Ziti-controller needs to start first, then ziti-router + need to create router policies on ziti-controller, and then start zrok which in turn will successfully create a private/ public share.

ziti edge create edge-router router-dev \
  --role-attributes "public" --tunneler-enabled --jwt-output-file /tmp/router-dev.jwt

ziti edge create edge-router-policy all-endpoints-public-routers --edge-router-roles "#public" --identity-roles "#all"

ziti edge create service-edge-router-policy all-routers-all-services --edge-router-roles "#all" --service-roles "#all"

Do you have any theories how this breaks with ArgoCD? Does ArgoCD significantly depart from the typical workflow of running the helm CLI to render templates and call KubeAPI to create the Helm Release?

Issues from ArgoCD mostly arouse when deleting resources which in turn also deleted the secret and PVC which stored sqlite database meaning zrok tried to bootstrap once again but the identities were already existing in ziti which caused error. In theory hooks and all the resources could be managed better by a kubernetes operator pattern with some custom CRDs but I'm not so experienced with it but it would probably make most sense. Umbrella chart probably would be easier to maintain but that gives less flexibility. ArgoCD essentially renders the helm chart with helm template and applies those manifests with kubectl, most hooks are working the same way and are mapped to argo-cd hooks on injection.

That would be most welcome. ☺️

There's a pattern in the ziti-controller and ziti-router charts for mounting additional volumes, but you may have a better way already established you could use for extraEnv vars from secret mounts or existing identities, or both.

I will try to get to it this week and open PR.

pavars · 2024-11-21T10:18:49Z

There might be a problem with setting db password for zrok using ArgoCD. I am not sure if ZROK supports environment variables in the ctrl.yaml file which is generated here: https://github.com/openziti/helm-charts/blob/main/charts/zrok/templates/controller-secrets-configmap.yaml#L272

I wanted to use a lookup for secret to replace the value for db password but I'm afraid that wont work with argocd argoproj/argo-cd#5202

An easier way would be just setting env variables there and application would read them from env. If that is not an option then as a dirty workaround initContainer could expand the script with envsubst and mount it on zrok.

qrkourier · 2024-11-22T18:17:12Z

Correct, zrok doesn't support env vars in its configs yet. Here's a couple of GitHub issues tracking improved config handling, including env vars:

I used envsubst for the Docker zrok sample.

Does this accurately summarize the password issue with ArgoCD?

ArgoCD consumes the and applies the manifests generated by the helm template command. The template command does not query KubeAPI, and so any Helm/Sprig functions in the templates are unable to effect logic like "generate a password unless mySecretPassword Secret resource already exists." In this scenario, a new password is always generated, so there's a mismatch between the assumptions built in to the templates and the assumptions of a template-to-GitOPs workflow like Helm+ArgoCD.

pavars · 2024-11-26T06:57:56Z

Does this accurately summarize the password issue with ArgoCD?

Yes, If secret exists then it shouldn't try to recreate the secret but since helm lookup doesn't properly work there it is trying to regenerate password and ArgoCD is showing it out-of-sync for ziti-controller.

Another issue is with hooks - in case of zrok there is pre-delete hook which is not actually supproted by ArgoCD and probably should be moved to post-delete hook, I don't see that it would break anything.

qrkourier · 2024-11-27T00:11:27Z

I couldn't think of a way to refactor the charts to be compatible with a GitOps workflow without adding manual steps to the main Helm-driven workflow, which involves calling the Kube API to manage existing resources and trigger life cycle hooks. I'm not giving up on GitOps by any means.

In the meantime, maybe you could insert a Kustomize step in your GitOps workflow like this:

Render manifest with helm templates
On first run, save the generated values in a patch.yaml file
On subsequent runs, patch the manifest with Kustomize's patchesStrategicMerge from patch.yaml
Commit manifest to Git
Push to Git remote for ArgoCD to apply

qrkourier self-assigned this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

pavars commented Nov 8, 2024

qrkourier commented Nov 12, 2024

qrkourier commented Nov 12, 2024 •

edited

Loading

qrkourier commented Nov 12, 2024 •

edited

Loading

qrkourier commented Nov 12, 2024

qrkourier commented Nov 12, 2024

pavars commented Nov 20, 2024

pavars commented Nov 21, 2024

qrkourier commented Nov 22, 2024

pavars commented Nov 26, 2024

qrkourier commented Nov 27, 2024

issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

Comments

pavars commented Nov 8, 2024

qrkourier commented Nov 12, 2024

qrkourier commented Nov 12, 2024 • edited Loading

qrkourier commented Nov 12, 2024 • edited Loading

qrkourier commented Nov 12, 2024

qrkourier commented Nov 12, 2024

pavars commented Nov 20, 2024

pavars commented Nov 21, 2024

qrkourier commented Nov 22, 2024

pavars commented Nov 26, 2024

qrkourier commented Nov 27, 2024

qrkourier commented Nov 12, 2024 •

edited

Loading

qrkourier commented Nov 12, 2024 •

edited

Loading