-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for cgroupsv2 #791
Conversation
still prototyping
still prototyping
By default, even when not explicitly enabled the cpu subsystem in v2 provides all the metrics we use. If cpu.{max,uclamp,weight} were to be used in the future this needs to be extended and made more specific.
still not working correctly, as apparently only the parent cgroup outputs to the io.stat file?
This allows the following to be run without any need for adminitrative privileges. ``` $ systemd-run --user --scope -p Delegate=true runexec --read-only-dir / date ``` This also moves the main runexec process into a child cgroup due to it not being possible to delegate controllers if a process exists in the parent. This will make it possible to report benchexec system usage.
This prevents the benchmarked process from changing the configured limits. So far, the cgroup with the limits it the same where we add the benchmarked process. But if we then delegate it into the container, the benchmarked process can access it and change the limits. So now we create a child cgroup of the cgroup with the limits, move the benchmarked process into the child, and make the child the root cgroup of the container. Then the limits are configured outside of the container and cannot be changed. This finishes #436 for runexec. We just need to take care that for some special operations we also use the child cgroup instead of the main one. An alternative would be the "nsdelegate" mount option for the cgroup2 fs. However, this needs to be set in the initial namespace, so we cannot enforce this. And at least on my Ubuntu system, it is missing, so we also not just declare it as a requirement.
The cgroup namespaces from the previous commit provide better isolation, so using them also makes sense for containerexec, not just for runexec. However, containerexec currently does not use and require cgroups at all, and we do not want to make them mandatory, such that containerexec stays usable for users without cgroup access. So we add a new argument --cgroup-access to containerexec that triggers use of cgroups for the run (without any limits etc.) and uses cgroup namespaces to provide a usable cgroup for the process inside the container. We support this flag only on systems with cgroupsv2 because with cgroupsv1 the cgroup namespaces do not work as well. runexec and benchexec do not get this argument, because they use cgroups anyway and thus --cgroup-access is implied. This closes #436.
This is a kernel problem that is hopefully fixed in the future, but until then likely to cause problems for users who use BenchExec on their own machines (and not dedicated servers). So we at least give them a quick workaround.
With benchexec/debian/benchexec-cgroup.service Lines 12 to 13 in 38b27bd
|
Yes, exactly. |
I am trying to package BenchExec for NixOS (see #920). I decided to go for the
|
Thanks a lot!
This sounds like the kernel problem from 6cbd9fc. If you force
Yes, this is a well-known kernel regression: #776
I seen only one remaining warning, which is the one discussed above.
The last log of |
Yes, you're right. I'd prefer the workaround to be printed even for the warning, but that's a matter of taste. Also, I had to run a few commands (see below) in alternation with echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/benchexec.slice/cgroup.subtree_control Now the warning is gone! And it looks like this is fixed in kernel 6.6.
I invoked
Hmm yeah, sorry. I was confused.
Yes. Some background: In NixOS, the
I don't think that's necessary. So, it seems that all warnings are resolved, CPU Energy Meter, LXCFS, and libseccomp are available. One thing that NixOS reviewers will ask is why I want to package an unreleased version. Are there any plans on releasing BenchExec with support for cgroups2? Do you need any more data or reports? |
Interesting. In my tests I think that systemd did that automatically once the initial command was run. But as long as BenchExec at least prints the next step each time it is run, I guess I will keep it that way because detecting what exactly to run upfront could be difficult.
Ah, nice! It is really hard to keep track of patches floating around in Linux land. I guess I can add a recommendation for Linux 6.6.
Which btw. is not strictly necessary, but does improve how the container's system config looks.
Right. The typical alternative would be to use something like
Ah, thanks for the explanation.
Ok, as you prefer. Maybe add
The most important reason was that I didn't get any feedback from others so far, so I wanted to do more internal tests. But thanks to you, I think I can just go ahead and release a version with cgroupsv2 and tell people to still treat it as somewhat experimental in the release notes. Ah, and I have to write more docs... So I hope I can do a release this week (but then without the polishings discussed in the last comments). |
OK, that's great news. I am not in a rush, so please take your time and feel free to polish things up before releasing in the coming weeks. I'll wait. |
We want to recommend installation of pystemd and provide appropriate documentation for users.
Instead of crashing, we continue and provide nice error messages if cgroups are required. For Podman we can even tell the user exactly what to do.
Mention @globin as contributor as this work is from him.
There is this kernel bug for cgroups v2 that prevents us from using CPUSET but everything else is working. In this case we provide a nice error message with a workaround (6cbd9fc), but check_cgroups did not do this so far because it terminated itself too early. Now we continue and let RunExecutor print the message.
So far, BenchExec prints a warning about every missing cgroup controller, and later on an error message if that controller is strictly required. But CPUSET is only required for core limits, and has no other use. This is different from e.g. MEMORY which is required for memory limits but also provides memory measurements. So lets silence the warning about missing CPUSET and keep only the error message. check_cgroups is not affected because if forces CPUSET to be required.
I thought about that more now, and decided to completely remove the warning about missing |
This mystery is now also solved. I had previously configured systemd on my machine with I have now updated the documentation and our Ubuntu package with this systemd configuration: 5b5bc49 @lorenzleutgeb You might want to add this to the Nix package as well. |
Instead of modifying |
Yes, this is correct. I decided to not do this in order to not overcomplicate things, because for example when installing a Debian package it is not easy to answer the question "which users should be allowed to do this" and we cannot even use a regular group or a single file, but have to do this individually for every single allowed user. Furthermore, delegating cgroups v2 should be safe in principle (even Podman considers enabling this by default for containers). Also cf. systemd/systemd@b8df7f8. But if Nix allows you to do this more easily, you can do so. |
Yes. The following expression with the free variable systemd.slices."user-${builtins.toString config.users.users.${username}.uid}".serviceConfig.Delegate = "yes"; |
This adds support for cgroupsv2 to BenchExec. Support for cgroupsv1 is kept as is.
There are currently open TODOs:
However, testing is already possible and would be highly appreciated. The actual functionality should be working fine.
There are two ways to let BenchExec use cgroupsv2:
Start
benchexec
/runexec
inside their own cgroup (as the only process). On systems with systemd, this can be done by prefixing the command line with:Let BenchExec create their own cgroup. This requires systemd and the
pystemd
Python package installed. On Ubuntu, for example installpython3-pystemd
.Please test and provide information in a comment here: