Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ziti-router: introduced health checks fail #226

Open
marvkis opened this issue Jun 16, 2024 · 5 comments
Open

ziti-router: introduced health checks fail #226

marvkis opened this issue Jun 16, 2024 · 5 comments

Comments

@marvkis
Copy link
Contributor

marvkis commented Jun 16, 2024

Hi,

I tried to upgrade to the latest helm charts. For my router the deployment never became active and was restarting. After some investigation I noticed the health checks introduced in this commit 3c2afc5 fialed to succeed. After manually removing them from the deplyment it became active and everything worked as expected.

I tried to execute them manually in the pod and it seems ziti agent stats fails to find the ziti router process:

[ziggy@core-router-858f76f8fb-fvtqm ~]$ ziti agent stats
Error: no processes found matching filter, use 'ziti agent list' to list candidates
Usage:
  ziti agent stats [flags]

Flags:
  -a, --app-alias string      Alias of host application to talk to (specified in host application)
  -i, --app-id string         Id of host application to talk to (like controller or router id)
  -t, --app-type string       Type of host application to talk to (like controller or router)
  -h, --help                  help for stats
  -p, --pid uint32            Process ID of host application to talk to
  -n, --process-name string   Process name of host application to talk to
      --tcp-addr string       Type of host application to talk to (like controller or router)
      --timeout duration      Operation timeout (default 5s)


no processes found matching filter, use 'ziti agent list' to list candidates
[ziggy@core-router-858f76f8fb-fvtqm ~]$ ziti agent list
╭─────┬────────────┬────────┬─────────────┬──────────┬─────────────┬───────────╮
│ PID │ EXECUTABLE │ APP ID │ UNIX SOCKET │ APP TYPE │ APP VERSION │ APP ALIAS │
├─────┼────────────┼────────┼─────────────┼──────────┼─────────────┼───────────┤
╰─────┴────────────┴────────┴─────────────┴──────────┴─────────────┴───────────╯

Hint: I'm using security contexts as additional security layer on my containers:

  podSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL
    privileged: false
    readOnlyRootFilesystem: true
    runAsGroup: 2171
    runAsNonRoot: true
    runAsUser: 2171
    seccompProfile:
      type: RuntimeDefault
  securityContext:
    fsGroup: 2171
    runAsNonRoot: true

might this prevent ziti agent stats to find its process?

bye,
Chris

@qrkourier
Copy link
Member

Yes, I think the read only root might have prevented placing the IPC socket used by the probe. It will be more elegant to use the controller and router HTTP health checks, but there's a little work needed to template those on a separate web listener's binding. That way they'll be published to the kubelet but not to all IP addresses, especially avoiding publishing them with the client API, because they could represent a vector for DoS.

@marvkis
Copy link
Contributor Author

marvkis commented Jun 17, 2024

What is the path to the IPC socket? We could mount a tmpfs for the IPC to have an initial solution before migrating to the HTTP checks.

@qrkourier
Copy link
Member

That might work. I've only observed the IPC socket in paths like this: /tmp/gops-agent.{pid}.sock.

@marvkis
Copy link
Contributor Author

marvkis commented Jun 18, 2024

That was it. I just added

additionalVolumes: 
- name: tmp
  volumeType: emptyDir
  mountPath: /tmp

and the health checks started working. What do you think? Should we add an emptyDir to /tmp by default or should we keep it this way? At least I would extend it to limit the size of the emptyDir sizeLimit: 1Mi. Call me paranoid, but just in case something 'unexpected' happens and someone finds a way to upload data to /tmp - the OOMKiller would kill the pod and we would be informed that something is going on here...

@qrkourier
Copy link
Member

It seems reasonable for us chart maintainers to define a non root emptyDir volume for temp, or to find a way to influence the socket path and ensure a writable volume is available there. I'll support using /tmp until we have a reason not to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants