-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP connection resets when CPU is limited #4164
Comments
Edit: Fixed the link to the demo app. |
I expect in the real thing you actually consume those request bodies? Seems like a problem with the sample. Do you see all those 600 connections coming through and actually hitting the service, or does the GCP load balancer terminate and do fewer connections? Any particular reason you are tuning up the max connections from the default (rather than down) when the resources are limited? |
Thanks for looking at this @johanandren! I appreciate this is a tricky one to investigate.
The akka-http framework might start to consume the request bodies, but our application code (the route handler) does not receive the requests. I know this because in testing I can match up the number of requests sent with the number of requests logged by our http service.
I believe that all 600 connections reach the service. There are a few reasons I think this:
The load balancer does not do fewer connections. It just keeps sending a split of the traffic as long as the backend service is healthy.
During normal operation (i.e. after the problematic first minute), the akka-http backend server can handle many hundreds of requests per second, even with the limited cpu. We find we need to increase the number of connections in order to make full use of the cpu. If we do not increase the number of connections, then much of the cpu is unused, which is wasteful. We would have to run more servers to handle the traffic, whereas we want to run as few servers as possible, and maximise cpu usage on each server But that all refers to the second minute and onwards. It's only the first minute where we get the 502 responses. |
Thanks for the extra details, that it is reproducible getting RSTs locally without a balancer in front as well should at least simplify investigations a little bit. |
I am finding that when a akka http server is run with very limited CPU then we get some TCP RSTs in the period soon after clients first start sending requests. These are the conditions that cause the errors:
Most requests are successful, but for some requests I see this error message in the client logs: "connection reset by peer". The errors all come within the first minute of the clients sending requests; subsequently there no more errors. There are no errors in the server logs, even with debug enabled.
Demo
I created this extemely simple server app to demonstrate the problem. It uses akka-http version 10.2.10 and akka version 2.6.20. I use the default akka configuration settings. The server has a single route which extracts the request and then responds with "OK".
First, I run it locally like this, pinning it to a single CPU:
Next, I limit the available CPU even further:
Then I run fortio to simulate load using 600 parallel connections, and posting a 7kb payload file:
Is this a bug?
I know, on the one hand I am slightly abusing this server by making it handle so many large requests with limited resources. But on the other hand, that doesn't seem like an excuse to see a RST. I would understand seeing a connection refused or a timeout error due to high load, but a RST seems more like a bug somewhere.
The reason I care about this... we see some errors in production, which I think is due to the same problem. The production setup is like this:
We find that when a new pod gets added to the service (and once it responds healthy), the load balancer routes a surge of requests very quickly to the new pod in one big hit. Some end clients then receive 502 error responses, and I believe this is because the akka server sends a RST to the load balancer.
By the way, to check if my demo is "fair" I ran the same test using a simple server written with https4s, not akka. The alternative app is over here. With the alternative app, and with identical test conditions, all http responses were successful.
The text was updated successfully, but these errors were encountered: