Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for creating a composefs from a directory #36

Merged
merged 5 commits into from
Nov 20, 2024

Conversation

allisonkarlitskaya
Copy link
Collaborator

This is basically the default mode of operation for mkcomposefs, but we're interested in getting exactly the same image as if we had created it from the layer tarballs.

This requires a bit more work, including adding selinux labels and some very careful thinking about the root directory.

This introduces a new examples/unified which builds the container in a single podman build. It gets the same result inside and outside (or at least it does for me) and boots up normally.

A preparatory commit defines the inlining constant to 64 and uses it, which means that repositories need to be recreated.

Closes #34
Closes #26

We've been doing this incorrectly by storing all non-empty files
externally.  Add a constant and use it internally.

This means that currently-existing splitstreams need to be regenerated:
they'll have also stored small files as external references.  Add some
extra checks at the splitstream-to-image stage that verifies that the
splitstream has followed the rules correctly: this will help identify
older streams that were built with incorrect rules.  This is another
"delete your respository and start over" change.

We *could* provide bridging code here: in case of too-small external
files in the splitstream, we could read them from the repository and
convert them to inline, but let's save ourselves the bother.

Closes #26

Signed-off-by: Allison Karlitskaya <[email protected]>
The `.insert()` operation for adding directory entries to the in-memory
filesystem structure previously only supported adding "leaf" inodes,
with the `.mkdir()` operation to be used for directories.  Made
`.insert()` take any Inode instead of just Leaf.

This will be helpful for our incoming filesystem-scanning code.

Signed-off-by: Allison Karlitskaya <[email protected]>
What's wrong with Vec<(OsString, Vec<u8>)>?  It's hard to do lookups, or
additions that replace existing values.

What's wrong with HashMap?  It's not sorted.  We'd like to sort the
output when producing dumpfiles.

We also move from OsString to Box<OsStr> and from Vec<u8> to Box<[u8]>.
These types better represent the immutability of these values once
they're in the map.  OsString and Vec are both based around being
mutable values and even might have extra space allocated.

Propagate these changes through the PAX handling in the tar code and
move it over to using more reasonable types (with similar rationale).

Tweak our SELinux relabel code to work in the presence of a filesystem
tree where labels may already be partially present: avoid adding
duplicate xattrs and remove labels in case none should be set.

Signed-off-by: Allison Karlitskaya <[email protected]>
The / entry doesn't appear in many layer tarballs. Until now, we've
arbitrarily created it root:root, 0755, with mtime set to the epoch.
Let's start thinking about this a bit more rigorously.

Add a doc/oci.md with some of these decisions spelled out more
explicitly.

The upshot: we now use 0555 instead of 0755 and we set the mtime to the
mtime of the newest file in the filesystem (instead of the epoch).

Signed-off-by: Allison Karlitskaya <[email protected]>
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34

Signed-off-by: Allison Karlitskaya <[email protected]>
Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work overall, just some nits

st_uid: 0,
st_gid: 0,
st_mtim_sec: 0,
st_mode: u32::MAX, // assigned later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Option (here and below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require adding Option on the Stat structure, which doesn't make sense for any case other than this one. This is ugly, but it works.

@@ -0,0 +1,53 @@
# How to create a composefs from an OCI image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really useful document!

@@ -0,0 +1,6 @@
# we want to make sure the virtio disk drivers get included
hostonly=no
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of stuff is also in the fedora-bootc base image.

@@ -0,0 +1,34 @@
# Copyright (C) 2013 Colin Walters <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be dropped, I am not sure I'd consider it "derived enough" honestly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and it's already been copied from the other two copies of it already kicking around in the examples/ directory...

# Copyright (C) 2013 Colin Walters <[email protected]>
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above probably for consistency we should use the overall repo license.

But I guess this is all a demo that may end up in a separate repo anyways.

let mut xattrs = BTreeMap::new();

let names_size = listxattr(&filename, &mut [])?;
let mut names = vec![0; names_size];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this vec![0u8; names_size] to start?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is annoying and awful. We don't know the names size in advance, so we have to query it in order to allocate the buffer. Fair enough: you do that by submitting a request with an empty buffer and get the correct size returned. Then you allocate and do it again. There's a race, of course, if the xattrs changed on disk in the meantime.

I was using a static buffer here at first (which I also shared with the static buffer I use for reading the actual values) but that brings us to the next problem: this API is defined in terms of c_char which is i8 — not u8. A slice of c_char might sound like it kinda makes sense when you're dealing with a C API, but when you write it (equivalently) as [i8] you notice how bizarre it is. I can't really see this as anything other than a rustix bug, to be honest...

"xattrs changed during read"
);

let names: Vec<u8> = names.into_iter().map(|c| c as u8).collect(); // fml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we could drop this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to drop it, but your above suggestion, for example, gets me:

mismatched types
   --> src/fs.rs:140:54
    |
140 |         let actual_names_size = listxattr(&filename, &mut names)?;
    |                                 ---------            ^^^^^^^^^^ expected `&mut [i8]`, found `&mut Vec<u8>`
    |                                 |
    |                                 arguments to this function are incorrect
    |
    = note: expected mutable reference `&mut [i8]`
               found mutable reference `&mut Vec<u8>`
note: function defined here
   --> /var/home/lis/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rustix-0.38.40/src/fs/xattr.rs:140:8
    |
140 | pub fn listxattr<P: path::Arg>(path: P, list: &mut [c::c_char]) -> io::Result<usize> {
    |        ^^^^^^^^^


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also transmute, but I've been absolutely loathe to write the word unsafe... there are only a very small number of counterexamples in this entire repository.


let names: Vec<u8> = names.into_iter().map(|c| c as u8).collect(); // fml

let mut buffer = [0; 65536];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be 0u8 here right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is inferred to u8, because that API (getxattr) wants [u8], not [i8].

let mut reader = FilesystemReader {
repo,
inodes: HashMap::new(),
st_dev: u64::MAX,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re Option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do Option here indeed, and it would be cleaner. I did it the way that I did it because I wanted to compare directly because the common case is that they will indeed be equal. Excessive micro-optimization, to be sure.

if self.st_dev == u64::MAX {
self.st_dev = buf.st_dev;
} else {
bail!("Attempting to cross devices while importing filesystem");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting, and I didn't know about it. Thanks.

rustix binds that via https://doc.servo.org/rustix/backend/fs/types/struct.ResolveFlags.html

The trouble with this is that it would make reading the root directory be different from all the other inodes: entering the bindmount from CWD would be considered a violation of the NO_XDEV rule. Still, though: we have a bunch of other ugly code here that's ugly specifically because the root node isn't handled separately, and it ought to be.

I'm going to try to see if I can use this flag and unroll the root directory case a bit to get something a bit nicer all around. Thanks!

@allisonkarlitskaya allisonkarlitskaya merged commit ecb1e96 into main Nov 20, 2024
2 checks passed
@allisonkarlitskaya
Copy link
Collaborator Author

I have a bunch of pending changes for this based on the review comments, but they're better dealt with as follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import and compute composefs' fs-verity in a multi-staged build use canonical recommend inlining
2 participants