Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache prefork problems with systemctl #140

Open
calh opened this issue Mar 4, 2022 · 0 comments
Open

Apache prefork problems with systemctl #140

calh opened this issue Mar 4, 2022 · 0 comments

Comments

@calh
Copy link

calh commented Mar 4, 2022

I'm investigating an odd and difficult to recreate problem with Apache using prefork MPM, and it seems to only happen inside Docker when using systemctl.

The two main issues I've observed are:

  1. When first starting up, Apache will not fork any new children beyond its StartServers + MinSpareServers setting. Also sometimes, it will do one fork event and then stop there and not fork any new children after that
  2. When Apache shuts down children via MaxRequestsPerChild to cycle through them, the children become zombies, but still account for an idle slot. Eventually the zombies suck up all of the slots and DoS the whole server

Since both of these are intermittent problems, it's really frustrating to isolate and debug. The best chance I can give to recreate this is:

Dockerfile

# syntax=docker/dockerfile:1.3-labs
FROM centos:centos7
RUN yum install -y httpd
RUN curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py > /usr/bin/systemctl \
  && systemctl enable httpd

COPY --chmod=755 <<EOF /var/www/cgi-bin/sleeper.cgi
#!/bin/bash
/bin/sleep 0.2
echo Content-Type: text-plain
echo
echo Hello World
EOF

COPY <<EOF /etc/httpd/conf.d/extra-config.conf
ExtendedStatus on
<Location /server-status>
 SetHandler server-status
 Order allow,deny
 Deny from none
 Allow from all
</Location>

StartServers       2
MinSpareServers    5
MaxSpareServers 20
ServerLimit     2048
MaxClients      2048
MaxRequestWorkers 2048
MaxRequestsPerChild  10
EOF

# Uncomment this to recreate the issue
CMD ["/usr/bin/systemctl", "-vvv"]
# Uncomment this to see it work fine
#STOPSIGNAL SIGWINCH
#CMD ["/usr/sbin/httpd", "-DFOREGROUND"]

On the client side, I was using something like this to recreate the problem with the best chance:

ab -n 1000000 -c 64 http://localhost:8081/cgi-bin/sleeper.cgi

No keep-alive requests, and hammer on it after startup. You can see it happen more slowly with 8 concurrency, and it takes a few minutes before the zombies build up and DoS the server.

After things are locked up, the process table looks like this:

  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 /usr/bin/python2 /usr/bin/systemctl -vvv
    8 ?        Ss     0:00 /usr/sbin/httpd -DFOREGROUND
  625 ?        Z      0:00 [httpd] <defunct>
 1808 ?        Z      0:00 [httpd] <defunct>
 1809 ?        Z      0:00 [httpd] <defunct>
 1811 ?        Z      0:00 [httpd] <defunct>
 1814 ?        Z      0:00 [httpd] <defunct>
 1815 ?        Z      0:00 [httpd] <defunct>
 1821 ?        Z      0:00 [httpd] <defunct>
 1822 ?        Z      0:00 [httpd] <defunct>
 1823 ?        Z      0:00 [httpd] <defunct>
 1828 ?        Z      0:00 [httpd] <defunct>
 1832 ?        Z      0:00 [httpd] <defunct>
 1836 ?        Z      0:00 [httpd] <defunct>
 1840 ?        Z      0:00 [httpd] <defunct>
 1842 ?        Z      0:00 [httpd] <defunct>
. . . 

And if you can catch the server-status page in time, it looks something like this:

image

I've been testing this for around a week now, and have gone through many permutations. Nothing has worked so far, but some of my tests at least delayed the inevitable for a while.

Some of the things I've tried:

  • More StartServers and MinSpareServers
  • Moving Apache's systemd to use Type=simple instead of Type=notify
  • Add a sleep call with ExecStartPre to see if there's some kind of race condition with filedescriptors
  • Switch CMD to start Apache in the foreground, which seems to work and not recreate the problems

I'm out of ideas on what to try next. Watching the children die via strace appears like it has something to do with waiting on closing filedescriptors... but it's difficult to get more information from a zombie process.

Any advice or help would be appreciated on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant