-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlay handling with fuse-overlayfs #1062
Overlay handling with fuse-overlayfs #1062
Conversation
By preserving all capabilities granted in the parent user namespace for the child process, we successfully utilize fuse-overlayfs (fusermount) to perform overlay mounts. This enhancement is effective when using containerexec and runexec, as benchexec creates containers using unshare rather than cloning a new process. Support for benchexec is currently under development.
By setting up the container's filesystem in the child process, bind mount-related errors in benchexec caused by fuse-overlayfs can be avoided. This change will not affect the normal operation of kernel overlayfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code in general looks really good! I have hardly any suggestions for improvements about the code quality, only very minor stuff. 👍
Of course we can discuss a few conceptual questions further, for example how exactly to decide on when to use the fallback, do we need a configuration option for the user, etc. But let us first get everything working.
If I try this out on Ubuntu 22.04 with |
It's working fine on my Ubuntu Server 24.04, first let me post my log (Later, I even created a new regular user and used this new user to build BenchExec from scratch according to the documentation, and I did not encounter any errors.): haoranyang@haoranyang-fudan:~$ ls
benchexec benchexec-venv BOLT branchy branchy.c go perf.data podman sleep.fdata sleep.yaml util-linux v2ray
haoranyang@haoranyang-fudan:~$ source benchexec-venv/bin/activate
(benchexec-venv) haoranyang@haoranyang-fudan:~$ pip install -e benchexec/
Obtaining file:///home/haoranyang/benchexec
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: PyYAML>=3.12 in ./benchexec-venv/lib/python3.12/site-packages (from BenchExec==3.22.dev0) (6.0.1)
Building wheels for collected packages: BenchExec
Building editable for BenchExec (pyproject.toml) ... done
Created wheel for BenchExec: filename=BenchExec-3.22.dev0-0.editable-py3-none-any.whl size=21486 sha256=d8c4068011771de172d93b289fdcc8eb9760cf845a00a1070906472c33358930
Stored in directory: /tmp/pip-ephem-wheel-cache-4tp07kag/wheels/f9/0a/ba/54baf05cf463ec13ae25c2146ce7921a4dfd5a323c55306c95
Successfully built BenchExec
Installing collected packages: BenchExec
Attempting uninstall: BenchExec
Found existing installation: BenchExec 3.22.dev0
Uninstalling BenchExec-3.22.dev0:
Successfully uninstalled BenchExec-3.22.dev0
Successfully installed BenchExec-3.22.dev0
(benchexec-venv) haoranyang@haoranyang-fudan:~$ ./benchexec/bin/containerexec --debug bash
2024-07-09 23:12:45 - DEBUG - This is containerexec 3.22-dev.
2024-07-09 23:12:45 - INFO - Starting command bash
2024-07-09 23:12:45 - DEBUG - Available Cgroups: {}
2024-07-09 23:12:45 - DEBUG - Starting process.
2024-07-09 23:12:45 - DEBUG - Parent: child process of RunExecutor with PID 14690 started.
2024-07-09 23:12:45 - DEBUG - Child: child process of RunExecutor with PID 14690 started
2024-07-09 23:12:45 - DEBUG - Failed to make b'/tmp/BenchExec_run_ccbkktpf/mount/home/benchexec' a bind mount: [Errno 2] mount(b'/tmp/BenchExec_run_ccbkktpf/mount/home/benchexec', b'/tmp/BenchExec_run_ccbkktpf/mount/home/benchexec', None, 4096, None) failed: No such file or directory
2024-07-09 23:12:45 - DEBUG - Mounting '/' as overlay
2024-07-09 23:12:45 - DEBUG - [Errno 16] umount(b'/tmp/BenchExec_run_ccbkktpf/mount/') failed: Device or resource busy
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/', lower=b'/', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/1'
2024-07-09 23:12:45 - DEBUG - Cannot use kernel overlay for /: [Errno 22] mount(b'none', b'/tmp/BenchExec_run_ccbkktpf/mount/', b'overlay', 0, b'lowerdir=/,upperdir=/tmp/BenchExec_run_ccbkktpf/temp/,workdir=/tmp/BenchExec_run_ccbkktpf/overlayfs/1') failed: Invalid argument. Trying to use fuse-overlayfs instead.
2024-07-09 23:12:45 - DEBUG - Creating overlay mount with fuse-overlayfs: target=b'/tmp/BenchExec_run_ccbkktpf/mount/', lower=b'/', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/1'
2024-07-09 23:12:45 - DEBUG - Mounting '/dev' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/dev/pts' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/dev/shm' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/dev/mqueue' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/dev/hugepages' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/run' as hidden
2024-07-09 23:12:45 - DEBUG - Mounting '/sys' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/kernel/security' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/fs/cgroup' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/fs/pstore' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/fs/bpf' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/kernel/tracing' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/kernel/debug' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/fs/fuse/connections' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/sys/kernel/config' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/proc' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/var' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/var', lower=b'/var', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/var', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/2'
2024-07-09 23:12:45 - DEBUG - Cannot use kernel overlay for /var: [Errno 22] mount(b'none', b'/tmp/BenchExec_run_ccbkktpf/mount/var', b'overlay', 0, b'lowerdir=/var,upperdir=/tmp/BenchExec_run_ccbkktpf/temp/var,workdir=/tmp/BenchExec_run_ccbkktpf/overlayfs/2') failed: Invalid argument. Trying to use fuse-overlayfs instead.
2024-07-09 23:12:45 - DEBUG - Creating overlay mount with fuse-overlayfs: target=b'/tmp/BenchExec_run_ccbkktpf/mount/var', lower=b'/var', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/var', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/2'
2024-07-09 23:12:45 - DEBUG - Mounting '/var/lib/lxcfs' as read-only
2024-07-09 23:12:45 - DEBUG - Mounting '/boot' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/boot', lower=b'/boot', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/boot', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/3'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/core22/1380' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/core22/1380', lower=b'/snap/core22/1380', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/core22/1380', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/4'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/core22/1122' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/core22/1122', lower=b'/snap/core22/1122', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/core22/1122', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/5'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/snapd/21759' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/snapd/21759', lower=b'/snap/snapd/21759', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/snapd/21759', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/6'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/snapd/21465' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/snapd/21465', lower=b'/snap/snapd/21465', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/snapd/21465', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/7'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/lxd/25846' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/lxd/25846', lower=b'/snap/lxd/25846', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/lxd/25846', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/8'
2024-07-09 23:12:45 - DEBUG - Mounting '/snap/lxd/26200' as overlay
2024-07-09 23:12:45 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_ccbkktpf/mount/snap/lxd/26200', lower=b'/snap/lxd/26200', upper=b'/tmp/BenchExec_run_ccbkktpf/temp/snap/lxd/26200', work=b'/tmp/BenchExec_run_ccbkktpf/overlayfs/9'
2024-07-09 23:12:45 - DEBUG - Mounting '/tmp' as hidden
2024-07-09 23:12:45 - DEBUG - Mounting '/run' as hidden
2024-07-09 23:12:45 - DEBUG - Parent: executing bash in grand child with PID 14698 via child with PID 14690.
2024-07-09 23:12:45 - DEBUG - Waiting for signals
2024-07-09 23:12:45 - DEBUG - Waiting for signals
benchexec@benchexec:/home/haoranyang$ ls /run/
shm
benchexec@benchexec:/home/haoranyang$ exit
exit
2024-07-09 23:20:06 - DEBUG - Child: process bash terminated with exit code 0.
2024-07-09 23:20:06 - DEBUG - 0 output files matched the patterns and were transferred.
2024-07-09 23:20:06 - DEBUG - Waiting for process bash with pid 15540
2024-07-09 23:20:06 - DEBUG - Parent: child process of RunExecutor with PID 15540 terminated with return value 0.
2024-07-09 23:20:06 - DEBUG - Process terminated, exit code 0.
2024-07-09 23:20:06 - DEBUG - Cleaning up temporary directory.
(benchexec-venv) haoranyang@haoranyang-fudan:~$ You can post the entire log. I will test it again tomorrow on another version of an Ubuntu machine to rule out the possibility that the issue is caused by version differences. And I have to say this is a bit strange. |
Sure, but there is nothing interesting in it I guess:
(with 85f02ca) |
I tried on Ubuntu 22.04/Linux 5.15 (Windows Sub Linux), it failed with the same log, pending after |
I conducted another test on a different Tencent Cloud server instance with Ubuntu 24.04 and kernel version 6.8.0-36-generic. I built and ran it from scratch using a regular user. There is now reason to believe that the BenchExec hang is caused by the difference of Ubuntu versions. |
Ok, good to know. I guess it will be either fixable or at least we can declare old kernels as unsupported or so. |
An interesting question just came to my mind: The On the other hand, if fuse-overlayfs is not in that cgroup, will it still work that files written are counted against the memory limit? All files written by the benchmarked process typically end up in a |
This is a very good question indeed, so I've tested the mentioned two things:
|
I am wondering about two more things. Right now we create one overlay mount per mountpoint in the system (if configured so), and with fuse-overlayfs this means one fuse-overlayfs instance (process) per mountpoint. For the kernel overlayfs this is required, because it does not support being used across mountpoint boundaries. But for fuse-overlayfs this should not be required. I think that it is sufficient to have one single fuse-overlayfs mount (for I think it is not much effort: Create one fuse-overlayfs mount for Do you think that makes sense and would you try it? This is also related to another question I have. Right now we use fuse-overlayfs only if required, and automatically. This means that (a) users might not be aware about what is happening and (b) we could have a mixture of mount points, some using fuse-overlayfs and some using kernel overlayfs. I am not sure whether we want this, in particular (b). For (b) I think we should at least try to avoid this. We could for example check the list of mountpoints first. If there is a mountpoint that is configured for overlay mode and for which another mountpoint below it exists, then we use fuse-overlayfs for all overlay mounts. Otherwise we only use kernel overlayfs. What do you think? For (a) I am wondering whether we want a command-line argument to influence the behavior, but maybe it is not necessary. |
I just noticed that we have another use case of BenchExec (#1067) where this could help, and tried out so see whether it does. And indeed, it already works nicely. |
I agree with having one single However, I don't fully understand the concept of avoiding a mixture of mount points. The example you provided suggests checking the mount points first; if a mount point that needs an overlay mount contains another mount point below it, we should use a fuse-based mount for all overlay mount points. I don't quite grasp the logic here. Could you please explain this in a bit more detail? |
Thanks!
Sorry for the incomplete explanation. Your understanding is correct so far. The check I had in mind would be something like:
Is this explanation better? |
Sorry, I realized I forgot why we needed to use |
Actually, I now realized that if we do what I suggested, we would miss the detection of nested overlay mounts as explained in #1062 (comment). Whether the nesting limit for kernel overlayfs is already reached is something that I don't think we can check upfront. But it could still be that we are asked to mount an overlay to So I am wondering whether we can implement something such that we fall back to fuse-overlayfs for all overlayfs mounts even if only some late mount point fails. Do you have an idea? |
I'm not sure what we might soon be doing has to do with the detection of nested overlay mounts. If we take measures to avoid mixing usage of kernel overlayfs and fuse-overlayfs, then in cases of multi-layer nesting, either every layer will be based on kernel overlayfs or every layer will be based on fuse-overlayfs. If it’s the former, the kernel will report an error and benchexec will throw an exception. If it’s the latter, I’m not sure how many layers fuse-overlayfs can nest—maybe it's unlimited. I don't quite understand the connection. Please forgive my constant confusion. :-( |
Don't worry, I can understand. Your description is correct. What I mean is the following: In my nesting example we have 3 layers and the third one needs to use fuse-overlayfs. However, the detection mechanism that I described above for whether we should use kernel or fuse would answer "kernel" in all three layers (because it only looks at submounts and there are none). So the third layer would fail, but we want it to automatically use fuse-overlayfs and now throw the execption you mentioned. So far understandable now? The problem that I do not know how to detect the requirement to use fuse-overlayfs because the nesting limit is reached upfront (before creating the mounts). Only once we attempt a kernel overlayfs mount and it fails with "invalid argument" we will notice. |
Sorry for the delay, I was a little busy yesterday. Thanks for your explanation; I understand now. The test command is If we change the implementation so that all overlay mount points either use kernel overlayfs or fuse-overlayfs entirely, and the criterion for choosing between the two shifts from "use fuse when kernel overlayfs fails" to "use fuse if there’s a mount point below an overlay mode mount point," then kernel overlayfs errors would no longer be recoverable by handing off to fuse-overlayfs. |
If my understanding is correct, we should keep the previous implementation unchanged, while adding your suggested approach of checking beforehand whether any overlay mount points are non-leaf. Reconsidering the use case you proposed: in the first and second layers, kernel overlayfs would handle all the overlay mounts. In the third layer, when kernel overlayfs fails, we switch to fuse-overlayfs in the exception handling process. This approach doesn’t seem to pose any issues either. |
Hm, yes, that could be a good choice until we find something better. It would be consistent in the majority of cases, but still work for all cases. Thanks! |
Thanks. If we can reproduce it locally, it is not difficult to debug. I started to print out
This we can find in the BenchExec source: benchexec/benchexec/container.py Lines 641 to 649 in 3ddb913
Because fuse filesystems could be some weird stuff, we decided to play safe and not use them for overlayfs (could be problematic especially for the kernel overlayfs). So we just need to extend the linked check with |
Thank you very much for your debugging! I have made changes based on the above suggestions and submitted the code. Are there any other issues with the code currently submitted? Perhaps we could merge it? Before doing so, I would like to review the handling of various error cases once more, such as checking if the error messages provided are user-friendly. |
I think there are no known issues left. 🎉 Really cool! Thank you a lot for your work! BenchExec users will be really grateful. It's been some time since I had a comprehensive look over the code, so I would prefer to do one last review before merging. Unfortunately I am really busy right now because I have two upcoming talks to give, so it could take a week. I hope I find the time to squeeze it in sooner. But I do not really expect any problems to come up. If you could post the last state of the error messages, that would indeed be great. |
I'm also very happy to see that we're close to completing this work. It's truly been an enjoyable and rewarding journey! I'll also review the code and summarize what I've learned during this GSoC. I'll post the error messages ASAP. |
|
In 88db419, I primarily made some modifications to the logic for handling overlay errors. When the kernel overlayfs fails, based on the values of the variables
Additionally, when using
I believe this is inappropriate. In fact, within the benchexec/benchexec/container.py Lines 989 to 991 in 88db419
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great! I have to thank you for you so many things.
First, 🙏 a lot for your patience while I was traveling!
Second, 🙏 for polishing the error messages and providing the nice summary, which makes it super convenient for me to review.
Third, 🙏 for explicitly thinking about the possible cases and providing this great overview. This again is so nice for me as reviewer, and also certainly helpful at some point in the future when we revisit this code for some reason.
Of course, I fully agree with what you wrote. 👍
I also did a full code review again and I really like the code. All the comments and explanations are so helpful for understanding what is done and (more importantly even) why. 👍
This is now really really close to being merged 🚀 , the only things I found are the following:
-
Seems like Overlay handling with fuse-overlayfs #1062 (comment) was overlooked? (I blame GitHub here, not you, its UI for long PR discussions is really not good. 😞) This is not critical for the merge, but of course easy to fix.
-
And I investigated a little bit of what happens with non-standard modes for
/tmp
and think there should be an easy fix, would you like to implement this (cf. my other comment)? This seems to be important to handle before we merge. -
I also noticed something else that is weird:
benchexec test/tasks/benchmark.xml
is noticeably slower thanbenchexec --read-only-dir / test/tasks/benchmark.xml
for me. Is this the same for you? I am wondering where the performance difference could come from.
I tried to cacheget_fuse_overlayfs_executable
by adding@functools.lru_cache
but it is still slow. All the calls tosetup_fuse_overlay
also took only 0.05s together, which is much less than the time difference. Any idea what this could be?
This would be nice to solve, of course, but if we do not get an understanding of what is going on quickly, I do not see this as blocker. For me, whenever kernel overlayfs is used, it is as fast as before.
…xecutorWithContainer
Thank you for your kind words! I've been celebrating the Mid-Autumn Festival recently, so I didn't have the chance to reply to the comments earlier. I have submitted two new patches to address the two points that needed modification from your comments.
I don't think there is a noticeable difference in execution speed between the two on my machine. I conducted the tests on Ubuntu 24.04 with fuse-overlayfs 1.13. I've used
|
No worries.
Thanks!
Hm, but you also have a difference from 1.2s to 1.8s. For me it was like 1.2s to 2.2s, so same order of magnitude. But I think we can accept this for now and see how it works in practice. Optimizations are fine to add later. Thanks for testing! |
I think I know why using According to containers/fuse-overlayfs#386, this could be due to the use of benchexec/benchexec/containerexecutor.py Lines 1083 to 1084 in 3ddb913
A fuse-overlayfs developer tend to believe that this is a kernel issue (ref to containers/fuse-overlayfs#386 (comment)). |
It's still not working... I've added this before self._dir_modes.update({temp_dir: DIR_HIDDEN}) And I ran
|
Good way to implement my suggestion! Sad that it does not work. I had the idea to try make all I really don't understand why: It fails during creating Maybe we just have to declare this as unsupported for now? I.e., if the mode for |
The temp_base directory (.../temp) is the one that BenchExec uses to store output files of the tool, and after a run we iterate through it and copy files from there to the output directory. Thus we should not use it for internal stuff. But the work_base directory is fine for that. So let's move the fuse mountpoint to work_base as well.
Somehow this causes deadlocks that we did not manage to solve even by making our own temp directory hidden. So let's at least avoid the deadlock and provide a proper error message. More background is in the discussions: #1062 (comment) #1062 (comment)
Ok, so I implemented an error message that prevents running into the deadlock and did some very minor polishing. @younghojan Do you want to have a look at what I did? GitHub doesn't let me mark this PR as approved because I committed to it, but I am happy with the current state and think it is ready. 🚀 I would merge it and make a BenchExec release at the latest tomorrow. 🥳 🎉 Thanks a lot again, this was really great work and is highly valuable! It may not look that much when looking at the line count, but one should know that when working on such a low-level technical layer it is often a lot of effort to find and handle all special cases etc., and this was no exception. |
Lots of thx! I’m very happy to see that our efforts have finally led to a satisfying conclusion. 🥳 The code you added looks really great! I would like to thank everyone who has helped me throughout this GSoC again! 🎉 |
This implements #928: Use fuse-overlayfs as fallback for the kernel overlayfs, with the goal of providing a solution for #776 and #1067.
Discussion of progress is in #1036.