-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for creating a composefs from a directory #36
Conversation
6caf05f
to
35b9a57
Compare
We've been doing this incorrectly by storing all non-empty files externally. Add a constant and use it internally. This means that currently-existing splitstreams need to be regenerated: they'll have also stored small files as external references. Add some extra checks at the splitstream-to-image stage that verifies that the splitstream has followed the rules correctly: this will help identify older streams that were built with incorrect rules. This is another "delete your respository and start over" change. We *could* provide bridging code here: in case of too-small external files in the splitstream, we could read them from the repository and convert them to inline, but let's save ourselves the bother. Closes #26 Signed-off-by: Allison Karlitskaya <[email protected]>
The `.insert()` operation for adding directory entries to the in-memory filesystem structure previously only supported adding "leaf" inodes, with the `.mkdir()` operation to be used for directories. Made `.insert()` take any Inode instead of just Leaf. This will be helpful for our incoming filesystem-scanning code. Signed-off-by: Allison Karlitskaya <[email protected]>
What's wrong with Vec<(OsString, Vec<u8>)>? It's hard to do lookups, or additions that replace existing values. What's wrong with HashMap? It's not sorted. We'd like to sort the output when producing dumpfiles. We also move from OsString to Box<OsStr> and from Vec<u8> to Box<[u8]>. These types better represent the immutability of these values once they're in the map. OsString and Vec are both based around being mutable values and even might have extra space allocated. Propagate these changes through the PAX handling in the tar code and move it over to using more reasonable types (with similar rationale). Tweak our SELinux relabel code to work in the presence of a filesystem tree where labels may already be partially present: avoid adding duplicate xattrs and remove labels in case none should be set. Signed-off-by: Allison Karlitskaya <[email protected]>
The / entry doesn't appear in many layer tarballs. Until now, we've arbitrarily created it root:root, 0755, with mtime set to the epoch. Let's start thinking about this a bit more rigorously. Add a doc/oci.md with some of these decisions spelled out more explicitly. The upshot: we now use 0555 instead of 0755 and we set the mtime to the mtime of the newest file in the filesystem (instead of the epoch). Signed-off-by: Allison Karlitskaya <[email protected]>
src/fs.rs contains code for writing the in-memory filesystem tree to a directory on disk, so let's add the other direction: converting an on-disk directory to an in-memory filesystem tree. This will let us scan container images from inside containers. This is necessary because we can't get access to the OCI layer tarballs during a container build (even from a later stage in a multi-stage build) but we can bindmount the root filesystem. See containers/buildah#5837 With our recent changes to how we handle metadata on the root directory we should now be producing the same image on the inside and the outside, which gives us a nice way to produce a UKI with a built-in `composefs=` command-line parameter. Add a new 'unified' example. This does the container build as a single `podman build` command with no special arguments. Closes #34 Signed-off-by: Allison Karlitskaya <[email protected]>
35b9a57
to
b7c5c7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work overall, just some nits
st_uid: 0, | ||
st_gid: 0, | ||
st_mtim_sec: 0, | ||
st_mode: u32::MAX, // assigned later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Option
(here and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require adding Option
on the Stat
structure, which doesn't make sense for any case other than this one. This is ugly, but it works.
@@ -0,0 +1,53 @@ | |||
# How to create a composefs from an OCI image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really useful document!
@@ -0,0 +1,6 @@ | |||
# we want to make sure the virtio disk drivers get included | |||
hostonly=no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of stuff is also in the fedora-bootc base image.
@@ -0,0 +1,34 @@ | |||
# Copyright (C) 2013 Colin Walters <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be dropped, I am not sure I'd consider it "derived enough" honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and it's already been copied from the other two copies of it already kicking around in the examples/
directory...
# Copyright (C) 2013 Colin Walters <[email protected]> | ||
# | ||
# This library is free software; you can redistribute it and/or | ||
# modify it under the terms of the GNU Lesser General Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above probably for consistency we should use the overall repo license.
But I guess this is all a demo that may end up in a separate repo anyways.
let mut xattrs = BTreeMap::new(); | ||
|
||
let names_size = listxattr(&filename, &mut [])?; | ||
let mut names = vec![0; names_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this vec![0u8; names_size]
to start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is annoying and awful. We don't know the names size in advance, so we have to query it in order to allocate the buffer. Fair enough: you do that by submitting a request with an empty buffer and get the correct size returned. Then you allocate and do it again. There's a race, of course, if the xattrs changed on disk in the meantime.
I was using a static buffer here at first (which I also shared with the static buffer I use for reading the actual values) but that brings us to the next problem: this API is defined in terms of c_char
which is i8
— not u8
. A slice of c_char
might sound like it kinda makes sense when you're dealing with a C API, but when you write it (equivalently) as [i8]
you notice how bizarre it is. I can't really see this as anything other than a rustix bug, to be honest...
"xattrs changed during read" | ||
); | ||
|
||
let names: Vec<u8> = names.into_iter().map(|c| c as u8).collect(); // fml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we could drop this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to drop it, but your above suggestion, for example, gets me:
mismatched types
--> src/fs.rs:140:54
|
140 | let actual_names_size = listxattr(&filename, &mut names)?;
| --------- ^^^^^^^^^^ expected `&mut [i8]`, found `&mut Vec<u8>`
| |
| arguments to this function are incorrect
|
= note: expected mutable reference `&mut [i8]`
found mutable reference `&mut Vec<u8>`
note: function defined here
--> /var/home/lis/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rustix-0.38.40/src/fs/xattr.rs:140:8
|
140 | pub fn listxattr<P: path::Arg>(path: P, list: &mut [c::c_char]) -> io::Result<usize> {
| ^^^^^^^^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also transmute, but I've been absolutely loathe to write the word unsafe
... there are only a very small number of counterexamples in this entire repository.
|
||
let names: Vec<u8> = names.into_iter().map(|c| c as u8).collect(); // fml | ||
|
||
let mut buffer = [0; 65536]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be 0u8
here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is inferred to u8, because that API (getxattr
) wants [u8]
, not [i8]
.
let mut reader = FilesystemReader { | ||
repo, | ||
inodes: HashMap::new(), | ||
st_dev: u64::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto re Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do Option
here indeed, and it would be cleaner. I did it the way that I did it because I wanted to compare directly because the common case is that they will indeed be equal. Excessive micro-optimization, to be sure.
if self.st_dev == u64::MAX { | ||
self.st_dev = buf.st_dev; | ||
} else { | ||
bail!("Attempting to cross devices while importing filesystem"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting, and I didn't know about it. Thanks.
rustix binds that via https://doc.servo.org/rustix/backend/fs/types/struct.ResolveFlags.html
The trouble with this is that it would make reading the root directory be different from all the other inodes: entering the bindmount from CWD
would be considered a violation of the NO_XDEV
rule. Still, though: we have a bunch of other ugly code here that's ugly specifically because the root node isn't handled separately, and it ought to be.
I'm going to try to see if I can use this flag and unroll the root directory case a bit to get something a bit nicer all around. Thanks!
I have a bunch of pending changes for this based on the review comments, but they're better dealt with as follow-ups. |
This is basically the default mode of operation for
mkcomposefs
, but we're interested in getting exactly the same image as if we had created it from the layer tarballs.This requires a bit more work, including adding selinux labels and some very careful thinking about the root directory.
This introduces a new
examples/unified
which builds the container in a singlepodman build
. It gets the same result inside and outside (or at least it does for me) and boots up normally.A preparatory commit defines the inlining constant to 64 and uses it, which means that repositories need to be recreated.
Closes #34
Closes #26