-
Notifications
You must be signed in to change notification settings - Fork 380
VIP31: Backend connection queue
We want to avoid the instant failure nature of fetch transactions working with a backend that has reached its max_connections
setting.
Currently, we can mitigate a backend overload by limiting the number of concurrent connections, and for HTTP/1 that implies the same limit on concurrent requests. However, once a backend is saturated with work, any attempt of a new fetch fails immediately. For traffic spikes we would rather get a chance to wait for an opportunity before effectively failing on the max_connections
criterion.
A transaction trying to acquire a connection when the backend is already saturated should be queued. We don't want this queue to be able to grow forever, so we want both a limit and a timeout. Additionally, while waiting for a connection the task should disembark the fetch state machine instead of blocking a worker thread. When a transaction reembarks successfully, it should be guaranteed to reuse a connection or attempt a new one, in other words a successful reembark should not run into max_connections
saturation again.
New global parameters are needed to define a default queue limit and timeout:
-
backend_wait_timeout
(defaults to0
, meaning no timeout) -
backend_wait_limit
(defaults to0
, meaning no queuing)
The default values don't change the current max_connections
behavior. The parameter names were inspired by the existing backend_idle_timeout
also related directly to backend connection management.
It should be possible to override the global parameters on a per-backend basis:
backend unreliable {
.host = "unreliable.example.com";
.max_connections = 100;
.wait_timeout = 10s;
.wait_limit = 20;
}
In addition, it should be possible to override the timeout on a per-transaction basis:
sub vcl_backend_fetch {
if (bereq.backend == unreliable && bereq.url ~ "/non/critical") {
set bereq.wait_timeout = 1m; # we can afford to wait longer
}
}
This means that the max_connections
queue cannot be a mere fifo.
We will need to change the backend API and fetch state machine to introduce a step dedicated to attempting a backend connection. This could involve the waiter facility.
A director should ideally not consider a backend that will neither connect immediately nor wait for a connection to be available. We could teach regular backends to report sick when saturated, but only in the context of a transaction (note: vbe_healthy()
currently disregards ctx
).