Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmz: don't use runc-dmz in complicated capability setups #4137

Closed
wants to merge 1 commit into from
Closed

dmz: don't use runc-dmz in complicated capability setups #4137

wants to merge 1 commit into from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Dec 8, 2023

Due to the fact that runc-dmz is an intermediate binary without any special set-capability file attributes, using runc-dmz for containers with a non-root user can result in different capability sets being applied after the second execve().

Linux capabilities are quite complicated, and there are loads of different interactions between file and process capability sets, so we should just go with the most conservative rule to determine if we can't use runc-dmz -- if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz.

Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Fixes #4125 and is a safe alternative to #4129.
Signed-off-by: Aleksa Sarai [email protected]

Due to the fact that runc-dmz is an intermediate binary without any
special set-capability file attributes, using runc-dmz for containers
with a non-root user can result in different capability sets being
applied after the second execve().

Linux capabilities are quite complicated, and there are loads of
different interactions between file and process capability sets, so we
should just go with the most conservative rule to determine if we can't
use runc-dmz -- if the inheritable, permitted, and bounding sets are not
equal to the ambient set then we don't use runc-dmz.

Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Signed-off-by: Aleksa Sarai <[email protected]>
@113xiaoji
Copy link

Does memfd still support?How to open it if has merged this pr.

@cyphar
Copy link
Member Author

cyphar commented Dec 10, 2023

@113xiaoji RUNC_DMZ=legacy still works to disable runc-dmz, and this adds one more case where we use a /proc/self/exe memfd.

return true
}

func shouldUseDmzBinary(p *Process, c *configs.Config) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the simplest solution, but it seems like a bit of a shame to have this code and not use it... Should we remove the SELinux logic too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should define a ternary env var like RUNC_USE_DMZ=(1|0|auto).

The default value should be auto, however, for runc v1.2, I'd suggest to just treat this as an alias for 0 (false) to minimize the incompatibility.

In a future version of runc, we may implement more clever logic for auto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lifubang WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNC_DMZ=legacy can disable the dmz feature now. You mean you worry about there will be more imcompatible reasons not included in #4158 ? But we should know that if we set the default value to legacy, the k8s e2e test case about this area will fail? How to improve this test case in k8s?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might masquerade this in k8s if we disable runc-dmz if the container is not running as root. I think if it runs as root we don't need to change the capabilities.

I'm not sure if the root detection is hard or not safe and that is why it wasn't done here. I haven't looked into it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone mentioned that checking if we are root would be sufficient. To be honest, I struggle to understand all of the interactions of capabilities with everything else in the kernel (some of the functions in commoncap are actual line noise to my eyes).

The issue is that runc binary overwrites are only relevant for uid 0 in most cases. However, if runc-dmz is only used for unprivileged container users maybe that'd be okay for now (not uid 0 and no caps).

@rata
Copy link
Member

rata commented Dec 18, 2023

Please ping when this is ready for review now :-D

@rata
Copy link
Member

rata commented Dec 18, 2023

From the PR description:

if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz.

btw, it seems Kubernetes sets only bounding, effective and permitted (therefore, those will be different to ambient, that isn't set). I think this will open the "Kubernetes e2e tests fail with new runc" can of worms. That seems better than regressing on this, but saying it so we are warned :-)

@rata
Copy link
Member

rata commented Jan 8, 2024

@cyphar friendly ping?

@cyphar cyphar closed this by deleting the head repository Jan 11, 2024
@rata
Copy link
Member

rata commented Jan 11, 2024

@cyphar everything fine? Was the repo delete a mistake?

@cyphar
Copy link
Member Author

cyphar commented Jan 11, 2024

I will recreate PRs in a bit, yeah it was a mistake 😅. I still have everything locally.

@kolyshkin
Copy link
Contributor

@cyphar perhaps we'd better make runc-dmz non-default.

@cyphar
Copy link
Member Author

cyphar commented Jan 17, 2024

Yeah, I agree.

@lifubang
Copy link
Member

lifubang commented Jan 20, 2024

I had another research, I found that all things came from CVE-2019-5736, when we were dealing with it, we didn't want to deal with #!script name args.... I looked into the code about execve, I found that in linux, it support only two types, one is binfmt(flat/elf/misc/elf_fdpic), the other is script format. Before the kernel starts the program in execve, they do about two steps:

  1. Reading 256 bytes from the program file to the buf; (https://github.com/torvalds/linux/blob/9d64bf433c53cab2f48a3fff7a1f2a696bc5229a/fs/exec.c#L1662-L1668)
  2. Checking the first two bytes equals to #! or not, if it is script format, we can read the real entrypoint. (https://github.com/torvalds/linux/blob/master/fs/binfmt_script.c#L34-L42)

If this is workable, I think maybe we can using another way to defeat it. For example:

  1. We should declare in runtime-spec that the container's entrypoint should be the file in the container or mounted to the container, or else we MUST return an error;
  2. Reading 256 bytes from the program file to the buf;
  3. Checking the first two types equals to #! or not, if it is script format, we can read the real entrypoint.
  4. Check the root of the entrypoint and the root of the container are the same or not. if not, return an error.(It's easy, I had written it in a private fork)
  5. Remove all codes about CVE-2019-5736.

If it is worth to research, I will continue to do it. Looking forward to your feedback. @cyphar @kolyshkin

@cyphar
Copy link
Member Author

cyphar commented Jan 20, 2024

@lifubang There is a TOCTOU race if you read the file and then execute it afterwards. The file can change underneath you and the only way to exec the file is to actually call execve. It's not that we "didn't want to deal with #!" -- there is no other solution for CVE-2019-5736 unless we add support for restricting execve to the kernel. The entrypoint problem is recursive, and the core issue is we shouldn't be parsing executable files in userspace because the only way to have the correct behaviour for execve is to actually do the execve (which is unsafe).

Myself and several other maintainers spent several months dealing with this issue, if you think you've found a simple workaround it probably means you haven't considered all aspects of the attack. It's possible that we missed something, but I doubt we missed something as simple as you describe.

@lifubang
Copy link
Member

@cyphar perhaps we'd better make runc-dmz non-default.

Though we will make runc-dmz non-default, we still need to recreate this PR to fix this problem, and pephaps we can check it more strickly to disable runc-dmz.

@cyphar
Copy link
Member Author

cyphar commented Jan 22, 2024

If we make it non-default we only need to handle the security-related cases that affect runc-dmz. If a user enables runc-dmz and something breaks, they've found out they'll need to disable it.

@lifubang
Copy link
Member

@lifubang There is a TOCTOU race if you read the file and then execute it afterwards. The file can change underneath you and the only way to exec the file is to actually call execve. It's not that we "didn't want to deal with #!" -- there is no other solution for CVE-2019-5736 unless we add support for restricting execve to the kernel. The entrypoint problem is recursive, and the core issue is we shouldn't be parsing executable files in userspace because the only way to have the correct behaviour for execve is to actually do the execve (which is unsafe).

Myself and several other maintainers spent several months dealing with this issue, if you think you've found a simple workaround it probably means you haven't considered all aspects of the attack. It's possible that we missed something, but I doubt we missed something as simple as you describe.

Thanks, I know it's very diffcult and complex. I want to do a last discussion, if it's not valuable, I will give up.
I saw into the syscall execveat, if we open the entrypoint program file of the container as a fd with the O_CLOEXEC flag, execveat will return ENOENT error when the fd points to an interpreter program(such as a script starting with "#!"). So maybe we can use this feature to protect CVE-2019-5736?

  1. We should declare in runtime-spec that the container's entrypoint should be the file in the container or mounted to the container, or else we MUST return an error;
  2. Reading the first 2 bytes from the program file to the buf;
  3. Checking the first two bytes equals to #! or not.
    If yes, it is a Shebang format(script) file, we can follow the link to get the final real entrypoint file name, then open it as a fd with O_CLOEXEC flag;
    If no, it is a binfmt file, we can open it as a fd with O_CLOEXEC flag;
  4. Check the root of the fd and the root of the container are the same or not. if not, return an error.
  5. Using the fd from the step 3 to call execveat to start or exec into the container.
  6. I think the malicious process can only modify the content of the file refered by fd, if it becomes a script format file, because of O_CLOEXEC, execveat will block it.
    Could you help to indicate what's the race condition in this flow?

@cyphar
Copy link
Member Author

cyphar commented Jan 23, 2024

@lifubang For the record it would've been helpful if you mentioned BINPRM_FLAGS_PATH_INACCESSIBLE, which is what makes your O_CLOEXEC approach block #! scripts. I'd completely forgotten that this behaviour exists. 😅

My honest opinion is that I don't think BINPRM_FLAGS_PATH_INACCESSIBLE is a strong enough security boundary. I suspect it'd be possible to use the ELF loader to get /proc/self/exe loaded as a library, which would also give the container access to the file. It also just feels messy and I don't feel comfortable removing code we know protects against this issue (as well as DirtyCOW-style issues).

For the record, the race condition was that after step 3 the file contents can be changed, but since BINPRM_FLAGS_PATH_INACCESSIBLE blocks #! execution the race condition doesn't exist. Also, the interpreter of a #! script can also be a #! script AFAIK, so we would need to resolve things recursively, but that's a minor issue.

The long-term solution to removing the CVE-2019-5736 code is to restrict re-opening through magic-links in the kernel. Here is a talk I gave on this topic in May of last year. I have some prototypes and will work on them when I get back from vacation at the beginning of March.

@rata
Copy link
Member

rata commented Jan 26, 2024

@lifubang very nice research, but I think your approach doesn't work with the exploit we created at Kinvolk in 2019 for that CVE: https://kinvolk.io/blog/2019/02/runc-breakout-vulnerability-mitigated-on-flatcar-linux/.

It uses LD_PRELOAD and __attribute__((constructor)) to exploit it, it is not using an interpreter (it executes a binary).

I agree with @cyphar that BINPRM_FLAGS_PATH_INACCESSIBLE, sadly, doesn't seem like a strong-enough barrier. But very nice research, though!

@lifubang
Copy link
Member

lifubang commented Feb 1, 2024

It uses LD_PRELOAD and __attribute__((constructor)) to exploit it, it is not using an interpreter (it executes a binary).

Thanks, I tested it, and found that LD_PRELOAD and __attribute__((constructor)) still needs to use /proc/self/exe to exploit, but in my approach, I changed the runc behavior, runc will abandon any entrypoints that not belongs to the container file system jail. For example, If you use /proc/self/exe as the value of .process.args, runc will have an ability to detect it and stop to execve it.

@lifubang
Copy link
Member

lifubang commented Feb 1, 2024

I suspect it'd be possible to use the ELF loader to get /proc/self/exe loaded as a library,

I don't know how to test this behavior.
I tested LD_PRELOAD and __attribute__((constructor)) mentioned by @rata , I found that, when running the proload program, /proc/self/exe has became the entrypoint of the container, not runc init binary. So maybe the /proc/self/exe will also be erased and given the correct value when we start to load ELF.

@rata
Copy link
Member

rata commented Feb 1, 2024

@lifubang

Thanks, I tested it, and found that LD_PRELOAD and __attribute__((constructor)) still needs to use /proc/self/exe to exploit, but in my approach, I changed the runc behavior, runc will abandon any entrypoints that not belongs to the container file system jail. For example, If you use /proc/self/exe as the value of .process.args, runc will have an ability to detect it and stop to execve it.

I don't follow, the example I linked to doesn't use /proc/self/exe in the entrypoint, it uses /usr/bin/sh. The self-exe thingy is used in a library that is LD_PRELOAD'ed, and that is it. What am I missing?

@lifubang
Copy link
Member

lifubang commented Feb 1, 2024

What am I missing?

/exe -> /proc/self/exe
image

@lifubang
Copy link
Member

lifubang commented Feb 1, 2024

I found that, when running the proload program in foo.so, /proc/self/exe has became the entrypoint of the container, not runc init binary, if we don't use /proc/self/exe as the entrypoint of the container.

@rata

@rata
Copy link
Member

rata commented Feb 1, 2024

@lifubang ohh, sorry, I didn't understand what that meant. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc-dmz: Inheritable capabilities are dropped when they previously weren't
6 participants