Web container crashes when health probes are configured #104

lindhe · 2024-02-12T09:02:12Z

Package version (if known): 0.2.1 (commit f7fd3c6).

Describe the bug

The crashes of the web container in the web pod seems to begins just after helm install and continues even after wipe_recreate.sh (so it's not just that the system is unintialized). These are uncaught errors, so it's not very easy to know exactly what goes wrong here. I have attached the logs from the web container here, which shows two notable things:

Despite this being Python, we get a segfault. So there's obviously a buggy binary included somewhere here (myabe uwsgi).
It prints SIGINT/SIGTERM received...killing workers... and KeyboardInterrupt. Those signals are not coming from Kubernetes, because we have not configured a liveness probe (only startup and readiness). This indicates to me that there may be some built in health checks that's getting in the way for us here.

I want to emphasize that while a liveness probe may cause a SIGTERM signal, neither the startup nor readiness probes will cause that. The only thing they do, really, is to prevent traffic being routed there via the Service in Kubernetes. This further convinces me that we have a some built-in health probe in Invenio and/or uWSGI that freaks out, possibly under the condition that traffic to Invenio's hostname does not reach its destination.

This is a major issue, as far as I'm concerned, and we really need to fix it before this chart can be considered stable.

Steps to Reproduce

/tmp/values.yaml

haproxy:
  enabled: false

host: invenio.example.com

invenio:
  secret_key: "secret-key"
  security_login_salt: "security_login_salt"
  csrf_secret_salt: "csrf_secret_salt"

ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  enabled: true
  class: nginx

postgresql:
  enabled: true

search:
  enabled: true

web:
  image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4
  replicas: 1
  resources:
    requests:
      cpu: 500m
      memory: 4Gi
    limits:
      cpu: 1000m
      memory: 4Gi

worker:
  image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4
  replicas: 1
  resources:
    requests:
      cpu: "2"
      memory: 1Gi
    limits:
      cpu: "2"
      memory: 1Gi

workerBeat:
  resources:
    requests:
      cpu: "2"
      memory: 1Gi
    limits:
      cpu: "2"
      memory: 1Gi

helm install -n invenio-dev invenio ./charts/invenio/ -f /tmp/values.yaml --set 'rabbitmq.auth.password=hn6mqDjygkjhgkhjgzBKrsNNkao,postgresql.auth.password=P6ph7jkhgkjhgkjNkao'

Expected behavior

The web container should be able to have health probes configured without crashing.

Additional context

In #52, @avivace has posted a similar issue. Because he talks about "the liveness probe", I'm assume it's a different issue (since there is no liveness probe configured for the web pod). Unfortunately, the link he posted was not a permalink so it is broken now and I cannot verify what he means. So I opened this new issue instead.

When I've been working on improving this Helm chart, I have had to remove the health probes every time I install it to test something. For reference, this is what I run to remove the health probes and check that the pod is replaced:

OLD_WEB_RS="$(kubectl get rs -l app=web -o name)"
kubectl patch deployment web --type=json -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/startupProbe"},{"op": "remove", "path": "/spec/template/spec/containers/0/readinessProbe"}]'
kubectl scale "${OLD_WEB_RS:?}" --replicas 0
kubectl rollout status deployment web

Logs

web.log

The text was updated successfully, but these errors were encountered:

avivace · 2024-02-12T12:20:50Z

Hi @lindhe , did you mean to reference to #52 instead of #51? If so, this is what my link was pointing at the time:

helm-invenio/invenio/templates/deployments/web.yaml

Line 93 in ab805e2

readinessProbe:

lindhe · 2024-02-12T12:45:19Z

Yes, that's a typo! Sorry. It's #52 I mean.

Ah, right. So you did mean readiness probe and not liveness probe then? Typos are great! 😄

Should I close this ticket then and repost as a comment in your original Issue, or do we keep this one?

mirekys · 2024-04-11T15:21:42Z

Hi @lindhe, the logs you attached suggests it is triggering some known race condition in greenlet lib used by uwsgi when shutting down invenio app. That seems to be fixed in greenlet>=3, but your demo image still has:

greenlet==2.0.2

(I checked also the latest demo-inveniordm:12.0.0-beta.1.3 release, but seems to use the same version & also suffers with segfaults)

After upgrading it to latest 3.x:

#Dockerfile
FROM ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm:12.0.0-beta.1.3

# Update greenlet to fixed version
RUN pip install greenlet">=3"
# See info below
ADD probe.sh /usr/local/bin/probe.sh
RUN chmod +x /usr/local/bin/probe.sh

docker build . -t demo-inveniordm:local

and replace:
...../demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 with docker.io/library/demo-inveniordm:local

I'm now getting some even more interesting parts of the original Python traceback:
web-greenlet3.log

And its for sure triggered by uwsgi_curl probe.

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:

#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping

and change probe cmds to:

exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"

and initialize DB by running:

invenio db create

it seems to work properly (do not ask me why 😅)

lindhe · 2024-04-15T06:32:42Z

Excellent findings, thank you for looking into this! I will try to verify and investigate further based on this. I'll be back! 😎

ntarocco · 2024-04-15T06:49:06Z

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:
#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping
and change probe cmds to:
exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"
and initialize DB by running:
invenio db create
it seems to work properly (do not ask me why 😅)

Thank you for checking this and finding that it is related to the probe, I think that we were missing to find the perpetrator :)
We discussed the DB part in the forum in Discord, and all probes will be working only after the DB init, basically when the Invenio app is up and running correctly.
BTW, we did not use the bash file but the command directly so we could easily use a variable for the Host header.

mirekys · 2024-04-16T13:23:55Z

I share & summarize here my observations from further testing and verifications:

With threads >= 4 & processes >= 6, it always fails to initialize Flask app at pretty much random spots when loading entrypoints. This happens on my 16 detected cores setup.
When threads and/or processes are lower (4&5, 3&5,...), it depends on whether invenio db create has been called:

not called - throws the following sqlalchemy error and dies miserably:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "banners" does not exist
LINE 2: FROM banners 
             ^

[SQL: SELECT banners.created AS banners_created, banners.updated AS banners_updated, banners.id AS banners_id, banners.message AS banners_message, banners.url_path AS banners_url_path, banners.category AS banners_category, banners.start_datetime AS banners_start_datetime, banners.end_datetime AS banners_end_datetime, banners.active AS banners_active 
FROM banners 
WHERE banners.active IS true AND banners.start_datetime <= %(start_datetime_1)s AND (banners.end_datetime IS NULL OR banners.end_datetime >= %(end_datetime_1)s) AND (banners.url_path IS NULL OR (%(param_1)s LIKE banners.url_path || '%%'))]
[parameters: {'start_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'end_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'param_1': '/ping'}]
(Background on this error at: https://sqlalche.me/e/14/f405)
[pid: 7|app: 0|req: 3/3]  () {16 vars in 188 bytes} [Tue Apr 16 13:28:16 2024] HEAD /ping => generated 0 bytes in 8 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 3)
SIGINT/SIGTERM received...killing workers...
terminate called after throwing an instance of 'std::runtime_error'
  what():  Accessing state after destruction.
worker 1 buried after 1 seconds

not called, using API probe - when I changed uwsgi_curl to query /api/ping instead of /ping it works independently of whether DB is fully initialized or not
called - both API & UI endpoint probes works

It also seems to depend on whether I use $(hostname) or 127.0.0.1 in uwsgi_curl command:

localhost - it works
$(hostname) - dies miserably after first probe check

SIGINT/SIGTERM received...killing workers...
terminate called after throwing an instance of 'std::runtime_error'
  what():  Accessing state after destruction.terminate called after throwing an instance of '
std::runtime_error'
[deadlock-detector] a process holding a robust mutex died. recovering...
  what():  Accessing state after destruction.
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
goodbye to uWSGI.

lindhe added the bug Something isn't working label Feb 12, 2024

Samk13 mentioned this issue Feb 13, 2024

The web pod probes time out #52

Closed

Samk13 added this to PR Community Mar 4, 2024

Samk13 moved this to To review :hourglass_flowing_sand: in PR Community Mar 4, 2024

Samk13 removed this from PR Community Mar 4, 2024

lindhe mentioned this issue Mar 22, 2024

Configurable {liveness,readiness,startup}Probes #110

Open

lindhe mentioned this issue Jul 5, 2024

Develop Invenio Helm Charts inveniosoftware/product-rdm#195

Open

lindhe mentioned this issue Sep 27, 2024

Invenio versioning #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web container crashes when health probes are configured #104

Web container crashes when health probes are configured #104

lindhe commented Feb 12, 2024 •

edited

Loading

avivace commented Feb 12, 2024

lindhe commented Feb 12, 2024 •

edited

Loading

mirekys commented Apr 11, 2024 •

edited

Loading

lindhe commented Apr 15, 2024

ntarocco commented Apr 15, 2024

mirekys commented Apr 16, 2024

Web container crashes when health probes are configured #104

Web container crashes when health probes are configured #104

Comments

lindhe commented Feb 12, 2024 • edited Loading

Describe the bug

Steps to Reproduce

Expected behavior

Additional context

Logs

avivace commented Feb 12, 2024

lindhe commented Feb 12, 2024 • edited Loading

mirekys commented Apr 11, 2024 • edited Loading

lindhe commented Apr 15, 2024

ntarocco commented Apr 15, 2024

mirekys commented Apr 16, 2024

lindhe commented Feb 12, 2024 •

edited

Loading

lindhe commented Feb 12, 2024 •

edited

Loading

mirekys commented Apr 11, 2024 •

edited

Loading