Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web container crashes when health probes are configured #104

Open
lindhe opened this issue Feb 12, 2024 · 6 comments
Open

Web container crashes when health probes are configured #104

lindhe opened this issue Feb 12, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@lindhe
Copy link
Contributor

lindhe commented Feb 12, 2024

Package version (if known): 0.2.1 (commit f7fd3c6).

Describe the bug

The crashes of the web container in the web pod seems to begins just after helm install and continues even after wipe_recreate.sh (so it's not just that the system is unintialized). These are uncaught errors, so it's not very easy to know exactly what goes wrong here. I have attached the logs from the web container here, which shows two notable things:

  1. Despite this being Python, we get a segfault. So there's obviously a buggy binary included somewhere here (myabe uwsgi).
  2. It prints SIGINT/SIGTERM received...killing workers... and KeyboardInterrupt. Those signals are not coming from Kubernetes, because we have not configured a liveness probe (only startup and readiness). This indicates to me that there may be some built in health checks that's getting in the way for us here.

I want to emphasize that while a liveness probe may cause a SIGTERM signal, neither the startup nor readiness probes will cause that. The only thing they do, really, is to prevent traffic being routed there via the Service in Kubernetes. This further convinces me that we have a some built-in health probe in Invenio and/or uWSGI that freaks out, possibly under the condition that traffic to Invenio's hostname does not reach its destination.

This is a major issue, as far as I'm concerned, and we really need to fix it before this chart can be considered stable.

Steps to Reproduce

/tmp/values.yaml

haproxy:
  enabled: false

host: invenio.example.com

invenio:
  secret_key: "secret-key"
  security_login_salt: "security_login_salt"
  csrf_secret_salt: "csrf_secret_salt"

ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  enabled: true
  class: nginx

postgresql:
  enabled: true

search:
  enabled: true

web:
  image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4
  replicas: 1
  resources:
    requests:
      cpu: 500m
      memory: 4Gi
    limits:
      cpu: 1000m
      memory: 4Gi

worker:
  image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4
  replicas: 1
  resources:
    requests:
      cpu: "2"
      memory: 1Gi
    limits:
      cpu: "2"
      memory: 1Gi

workerBeat:
  resources:
    requests:
      cpu: "2"
      memory: 1Gi
    limits:
      cpu: "2"
      memory: 1Gi

helm install -n invenio-dev invenio ./charts/invenio/ -f /tmp/values.yaml --set 'rabbitmq.auth.password=hn6mqDjygkjhgkhjgzBKrsNNkao,postgresql.auth.password=P6ph7jkhgkjhgkjNkao'

Expected behavior

The web container should be able to have health probes configured without crashing.

Additional context

In #52, @avivace has posted a similar issue. Because he talks about "the liveness probe", I'm assume it's a different issue (since there is no liveness probe configured for the web pod). Unfortunately, the link he posted was not a permalink so it is broken now and I cannot verify what he means. So I opened this new issue instead.

When I've been working on improving this Helm chart, I have had to remove the health probes every time I install it to test something. For reference, this is what I run to remove the health probes and check that the pod is replaced:

OLD_WEB_RS="$(kubectl get rs -l app=web -o name)"
kubectl patch deployment web --type=json -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/startupProbe"},{"op": "remove", "path": "/spec/template/spec/containers/0/readinessProbe"}]'
kubectl scale "${OLD_WEB_RS:?}" --replicas 0
kubectl rollout status deployment web

Logs

web.log

@lindhe lindhe added the bug Something isn't working label Feb 12, 2024
@avivace
Copy link
Member

avivace commented Feb 12, 2024

Hi @lindhe , did you mean to reference to #52 instead of #51? If so, this is what my link was pointing at the time:

@lindhe
Copy link
Contributor Author

lindhe commented Feb 12, 2024

Yes, that's a typo! Sorry. It's #52 I mean.

Ah, right. So you did mean readiness probe and not liveness probe then? Typos are great! 😄

Should I close this ticket then and repost as a comment in your original Issue, or do we keep this one?

@Samk13 Samk13 moved this to To review :hourglass_flowing_sand: in PR Community Mar 4, 2024
@Samk13 Samk13 removed this from PR Community Mar 4, 2024
@mirekys
Copy link

mirekys commented Apr 11, 2024

Hi @lindhe, the logs you attached suggests it is triggering some known race condition in greenlet lib used by uwsgi when shutting down invenio app. That seems to be fixed in greenlet>=3, but your demo image still has:

greenlet==2.0.2

(I checked also the latest demo-inveniordm:12.0.0-beta.1.3 release, but seems to use the same version & also suffers with segfaults)

After upgrading it to latest 3.x:

#Dockerfile
FROM ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm:12.0.0-beta.1.3

# Update greenlet to fixed version
RUN pip install greenlet">=3"
# See info below
ADD probe.sh /usr/local/bin/probe.sh
RUN chmod +x /usr/local/bin/probe.sh
docker build . -t demo-inveniordm:local

and replace:
...../demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 with docker.io/library/demo-inveniordm:local

I'm now getting some even more interesting parts of the original Python traceback:
web-greenlet3.log

And its for sure triggered by uwsgi_curl probe.

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:

#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping

and change probe cmds to:

exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"

and initialize DB by running:

invenio db create

it seems to work properly (do not ask me why 😅)

@lindhe
Copy link
Contributor Author

lindhe commented Apr 15, 2024

Excellent findings, thank you for looking into this! I will try to verify and investigate further based on this. I'll be back! 😎

@ntarocco
Copy link
Contributor

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:

#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping

and change probe cmds to:

exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"

and initialize DB by running:

invenio db create

it seems to work properly (do not ask me why 😅)

Thank you for checking this and finding that it is related to the probe, I think that we were missing to find the perpetrator :)
We discussed the DB part in the forum in Discord, and all probes will be working only after the DB init, basically when the Invenio app is up and running correctly.
BTW, we did not use the bash file but the command directly so we could easily use a variable for the Host header.

@mirekys
Copy link

mirekys commented Apr 16, 2024

I share & summarize here my observations from further testing and verifications:

  • With threads >= 4 & processes >= 6, it always fails to initialize Flask app at pretty much random spots when loading entrypoints. This happens on my 16 detected cores setup.
  • When threads and/or processes are lower (4&5, 3&5,...), it depends on whether invenio db create has been called:
  1. not called - throws the following sqlalchemy error and dies miserably:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "banners" does not exist
LINE 2: FROM banners 
             ^

[SQL: SELECT banners.created AS banners_created, banners.updated AS banners_updated, banners.id AS banners_id, banners.message AS banners_message, banners.url_path AS banners_url_path, banners.category AS banners_category, banners.start_datetime AS banners_start_datetime, banners.end_datetime AS banners_end_datetime, banners.active AS banners_active 
FROM banners 
WHERE banners.active IS true AND banners.start_datetime <= %(start_datetime_1)s AND (banners.end_datetime IS NULL OR banners.end_datetime >= %(end_datetime_1)s) AND (banners.url_path IS NULL OR (%(param_1)s LIKE banners.url_path || '%%'))]
[parameters: {'start_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'end_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'param_1': '/ping'}]
(Background on this error at: https://sqlalche.me/e/14/f405)
[pid: 7|app: 0|req: 3/3]  () {16 vars in 188 bytes} [Tue Apr 16 13:28:16 2024] HEAD /ping => generated 0 bytes in 8 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 3)
SIGINT/SIGTERM received...killing workers...
terminate called after throwing an instance of 'std::runtime_error'
  what():  Accessing state after destruction.
worker 1 buried after 1 seconds
  1. not called, using API probe - when I changed uwsgi_curl to query /api/ping instead of /ping it works independently of whether DB is fully initialized or not
  2. called - both API & UI endpoint probes works
  • It also seems to depend on whether I use $(hostname) or 127.0.0.1 in uwsgi_curl command:
  1. localhost - it works
  2. $(hostname) - dies miserably after first probe check
SIGINT/SIGTERM received...killing workers...
terminate called after throwing an instance of 'std::runtime_error'
  what():  Accessing state after destruction.terminate called after throwing an instance of '
std::runtime_error'
[deadlock-detector] a process holding a robust mutex died. recovering...
  what():  Accessing state after destruction.
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
goodbye to uWSGI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants