Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport series for 7.6.1 #4222

Merged
merged 14 commits into from
Nov 7, 2024
Merged

Conversation

walid-git
Copy link
Member

This is a set of backports in preparation for the 7.6.1 release.
Refs: #4195

doc/changes.rst Outdated Show resolved Hide resolved
doc/sphinx/reference/vsm.rst Show resolved Hide resolved
bin/varnishd/common/common_vsmw.c Show resolved Hide resolved
@walid-git
Copy link
Member Author

Seems like our fedora job is broken:

Unknown argument "groupinstall" for command "dnf5". Add "--help" for more information about the arguments.
It could be a command provided by a plugin, try: dnf5 install 'dnf5-command(groupinstall)'

nigoroll added a commit that referenced this pull request Nov 4, 2024
nigoroll added a commit that referenced this pull request Nov 4, 2024
@nigoroll
Copy link
Member

nigoroll commented Nov 4, 2024

Just so my apology has a place: I just pushed two commits to master which were meant to be just one. I just wanted to make it clear that we should not add anything else to VRT 20.1

dridi and others added 14 commits November 4, 2024 18:06
There is a race between the time we connect to a backend and when we increase
the connection count. This race is not normally observed, but when
.max_connections property is used we could end up creating more connections to
the backend than this limit permits. This is problematic as we end up consuming
more system resources, e.g., file-descriptors, than reserved for such backends.

The problem is that we release the backend lock and wait for the connect
to finish. This can take upto connect_timeout, which has a default timeout of
3.5 seconds, and only then do we count the connection as consumed towards the
.max_connections property.

This can create a recipe for bursty behavior where a backend already
struggling could be overwhelmed by a significant number of connections. Then,
as the connection count goes higher than the ceiling we either fail fetches or
queue them up. However, as soon as we once again drop below this limit we will
again over-commit and the cycle repeats.

The tests cases uses a blackhole backend to simulate the connect_timeout, this
excersice the race but under normal circumstances the likelihood for the race
to occur depend on the traffic (fetches), networking conditions, and the
performance of the backend.

The solution to the problem is to increase the connection count before
connecting to the backend, and if we fail to get a connection to the
backend we revert the count before failing the fetch. This will make any
fethes past the .max_connections limit to either outright fail, or
observe the presence of a queue to the backend.

Signed-off-by: Asad Sajjad Ahmed <[email protected]>
Pondering solutions for varnishcache#4183 it became apparent that we need a way to notify
directors of changes to the admin_health without crossing the VDI/VBE layering.

Besides adding another director callback, one option is to add an event to the
existing event facility, which so far was only used for VCL-related events. We
now add VDI_EVENT_SICK as a director-specific event, which is cheap and also
helps us maintain clean layering.
Using the new event, we can now selectively notify interested backend types
(so far: VBE only) when the admin health changes.

This fixes the layer violation introduced with
e46f972, where a director private pointer was
used with a VBE specific function.

Fixes varnishcache#4183
Try to raise the hard limit to infinity, then use as the soft (current) limit
whatever the hard (max) limit is.

Motivated by varnishcache#4193
Inform about current resource limits to help diagnosis.

Example output:

Child launched OK
Info: Child (792279) said Child starts
Info: Child (792279) said Warning: mlock() of VSM failed: Cannot allocate memory (12)
Info: Child (792279) said Info: max locked memory (soft): 1048576 bytes
Info: Child (792279) said Info: max locked memory (hard): 1048576 bytes

Motivated by varnishcache#4193
Solaris does not have it and I overlooked the CONFORMING TO section in the Linux
manpage.

Ref varnishcache#4193
Not my day for details today, it seems. :(
It turns out varnishd can be the main process of a Docker container when
another process like varnishlog is 'docker exec'uted in the same PID
namespace.

Spotted by @gquintard.
@walid-git
Copy link
Member Author

Pushed 61f386d (thanks @dridi) to 7.6 branch and rebased this PR on top of it.

@walid-git walid-git merged commit 4ef61c0 into varnishcache:7.6 Nov 7, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants