-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web container crashes when health probes are configured #104
Comments
Yes, that's a typo! Sorry. It's #52 I mean. Ah, right. So you did mean readiness probe and not liveness probe then? Typos are great! 😄 Should I close this ticket then and repost as a comment in your original Issue, or do we keep this one? |
Hi @lindhe, the logs you attached suggests it is triggering some known race condition in greenlet lib used by uwsgi when shutting down invenio app. That seems to be fixed in greenlet>=3, but your demo image still has:
(I checked also the latest After upgrading it to latest 3.x: #Dockerfile
FROM ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm:12.0.0-beta.1.3
# Update greenlet to fixed version
RUN pip install greenlet">=3"
# See info below
ADD probe.sh /usr/local/bin/probe.sh
RUN chmod +x /usr/local/bin/probe.sh docker build . -t demo-inveniordm:local and replace: I'm now getting some even more interesting parts of the original Python traceback: And its for sure triggered by uwsgi_curl probe. UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as: #!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping
and change probe cmds to: exec:
command:
- /bin/bash
- -c
- "/usr/local/bin/probe.sh" and initialize DB by running: invenio db create it seems to work properly (do not ask me why 😅) |
Excellent findings, thank you for looking into this! I will try to verify and investigate further based on this. I'll be back! 😎 |
Thank you for checking this and finding that it is related to the probe, I think that we were missing to find the perpetrator :) |
I share & summarize here my observations from further testing and verifications:
|
Package version (if known): 0.2.1 (commit f7fd3c6).
Describe the bug
The crashes of the web container in the web pod seems to begins just after
helm install
and continues even afterwipe_recreate.sh
(so it's not just that the system is unintialized). These are uncaught errors, so it's not very easy to know exactly what goes wrong here. I have attached the logs from the web container here, which shows two notable things:SIGINT/SIGTERM received...killing workers...
andKeyboardInterrupt
. Those signals are not coming from Kubernetes, because we have not configured a liveness probe (only startup and readiness). This indicates to me that there may be some built in health checks that's getting in the way for us here.I want to emphasize that while a liveness probe may cause a
SIGTERM
signal, neither the startup nor readiness probes will cause that. The only thing they do, really, is to prevent traffic being routed there via the Service in Kubernetes. This further convinces me that we have a some built-in health probe in Invenio and/or uWSGI that freaks out, possibly under the condition that traffic to Invenio's hostname does not reach its destination.This is a major issue, as far as I'm concerned, and we really need to fix it before this chart can be considered stable.
Steps to Reproduce
/tmp/values.yaml
helm install -n invenio-dev invenio ./charts/invenio/ -f /tmp/values.yaml --set 'rabbitmq.auth.password=hn6mqDjygkjhgkhjgzBKrsNNkao,postgresql.auth.password=P6ph7jkhgkjhgkjNkao'
Expected behavior
The web container should be able to have health probes configured without crashing.
Additional context
In #52, @avivace has posted a similar issue. Because he talks about "the liveness probe", I'm assume it's a different issue (since there is no liveness probe configured for the web pod). Unfortunately, the link he posted was not a permalink so it is broken now and I cannot verify what he means. So I opened this new issue instead.
When I've been working on improving this Helm chart, I have had to remove the health probes every time I install it to test something. For reference, this is what I run to remove the health probes and check that the pod is replaced:
Logs
web.log
The text was updated successfully, but these errors were encountered: