-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmz: don't use runc-dmz in complicated capability setups #4137
Conversation
Due to the fact that runc-dmz is an intermediate binary without any special set-capability file attributes, using runc-dmz for containers with a non-root user can result in different capability sets being applied after the second execve(). Linux capabilities are quite complicated, and there are loads of different interactions between file and process capability sets, so we should just go with the most conservative rule to determine if we can't use runc-dmz -- if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz. Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary") Signed-off-by: Aleksa Sarai <[email protected]>
Does memfd still support?How to open it if has merged this pr. |
@113xiaoji |
return true | ||
} | ||
|
||
func shouldUseDmzBinary(p *Process, c *configs.Config) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make dmz opt-in via an env or a CLI flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the simplest solution, but it seems like a bit of a shame to have this code and not use it... Should we remove the SELinux logic too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should define a ternary env var like RUNC_USE_DMZ=(1|0|auto)
.
The default value should be auto
, however, for runc v1.2, I'd suggest to just treat this as an alias for 0
(false) to minimize the incompatibility.
In a future version of runc, we may implement more clever logic for auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @lifubang WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUNC_DMZ=legacy
can disable the dmz feature now. You mean you worry about there will be more imcompatible reasons not included in #4158 ? But we should know that if we set the default value to legacy
, the k8s e2e test case about this area will fail? How to improve this test case in k8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might masquerade this in k8s if we disable runc-dmz if the container is not running as root. I think if it runs as root we don't need to change the capabilities.
I'm not sure if the root detection is hard or not safe and that is why it wasn't done here. I haven't looked into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone mentioned that checking if we are root would be sufficient. To be honest, I struggle to understand all of the interactions of capabilities with everything else in the kernel (some of the functions in commoncap are actual line noise to my eyes).
The issue is that runc binary overwrites are only relevant for uid 0 in most cases. However, if runc-dmz
is only used for unprivileged container users maybe that'd be okay for now (not uid 0 and no caps).
Please ping when this is ready for review now :-D |
From the PR description:
btw, it seems Kubernetes sets only bounding, effective and permitted (therefore, those will be different to ambient, that isn't set). I think this will open the "Kubernetes e2e tests fail with new runc" can of worms. That seems better than regressing on this, but saying it so we are warned :-) |
@cyphar friendly ping? |
@cyphar everything fine? Was the repo delete a mistake? |
I will recreate PRs in a bit, yeah it was a mistake 😅. I still have everything locally. |
@cyphar perhaps we'd better make runc-dmz non-default. |
Yeah, I agree. |
I had another research, I found that all things came from
If this is workable, I think maybe we can using another way to defeat it. For example:
If it is worth to research, I will continue to do it. Looking forward to your feedback. @cyphar @kolyshkin |
@lifubang There is a TOCTOU race if you read the file and then execute it afterwards. The file can change underneath you and the only way to exec the file is to actually call Myself and several other maintainers spent several months dealing with this issue, if you think you've found a simple workaround it probably means you haven't considered all aspects of the attack. It's possible that we missed something, but I doubt we missed something as simple as you describe. |
Though we will make runc-dmz non-default, we still need to recreate this PR to fix this problem, and pephaps we can check it more strickly to disable runc-dmz. |
If we make it non-default we only need to handle the security-related cases that affect |
Thanks, I know it's very diffcult and complex. I want to do a last discussion, if it's not valuable, I will give up.
|
@lifubang For the record it would've been helpful if you mentioned My honest opinion is that I don't think For the record, the race condition was that after step 3 the file contents can be changed, but since The long-term solution to removing the CVE-2019-5736 code is to restrict re-opening through magic-links in the kernel. Here is a talk I gave on this topic in May of last year. I have some prototypes and will work on them when I get back from vacation at the beginning of March. |
@lifubang very nice research, but I think your approach doesn't work with the exploit we created at Kinvolk in 2019 for that CVE: https://kinvolk.io/blog/2019/02/runc-breakout-vulnerability-mitigated-on-flatcar-linux/. It uses LD_PRELOAD and I agree with @cyphar that |
Thanks, I tested it, and found that |
I don't know how to test this behavior. |
I don't follow, the example I linked to doesn't use /proc/self/exe in the entrypoint, it uses /usr/bin/sh. The self-exe thingy is used in a library that is LD_PRELOAD'ed, and that is it. What am I missing? |
|
@lifubang ohh, sorry, I didn't understand what that meant. Thanks! |
Due to the fact that runc-dmz is an intermediate binary without any special set-capability file attributes, using runc-dmz for containers with a non-root user can result in different capability sets being applied after the second execve().
Linux capabilities are quite complicated, and there are loads of different interactions between file and process capability sets, so we should just go with the most conservative rule to determine if we can't use runc-dmz -- if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz.
Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Fixes #4125 and is a safe alternative to #4129.
Signed-off-by: Aleksa Sarai [email protected]