src: add internal erofs writer code

This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI. It's currently disabled by default. You can test it by setting the `COMPOSEFS_FORMAT=new` environment variable. This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the `COMPOSEFS_FORMAT=` variable will disable this compatibility mode. There's also a `COMPOSEFS_DUMP_EROFS=1` environment variable (which works with both `mkcomposefs` and our internal code) which will dump the erofs layout for diffing. There's also a standalone `erofs-debug` binary that will do the same. Additionally, this introduces two new files in docs: - a detailed description of the parts of erofs that we use - a document which attempts to describe the decisions made in creating an erofs composefs image (in terms of which order the files are in, etc). The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that. Many thanks to Gao Xiang <xiang@kernel.org> for helping clarify many points about the erofs file format for the documentation. Closes containers#56 Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
allisonkarlitskaya · Dec 16, 2024 · 60eeba8 · 60eeba8
1 parent 55ae2e9
commit 60eeba8
Showing 14 changed files with 2,614 additions and 31 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -16,8 +16,10 @@ anyhow = { version = "1.0.89", default-features = false }
 async-compression = { version = "0.4.17", default-features = false, features = ["tokio", "gzip"] }
 clap = { version = "4.5.19", default-features = false, features = ["std", "help", "usage", "derive"] }
 containers-image-proxy = "0.7.0"
+env_logger = "0.11.5"
 hex = "0.4.3"
 indicatif = { version = "0.17.8", features = ["tokio"] }
+log = "0.4.22"
 oci-spec = "0.7.0"
 regex-automata = { version = "0.4.8", default-features = false }
 rustix = { version = "0.38.37", features = ["fs", "mount", "process"] }
@@ -26,7 +28,8 @@ tar = { version = "0.4.42", default-features = false }
 tempfile = "3.13.0"
 thiserror = "2.0.4"
 tokio = "1.41.0"
-zerocopy = "0.8.13"
+xxhash-rust = { version = "0.8.12", features = ["xxh32"] }
+zerocopy = { version = "0.8.13", features = ["derive"] }
 zstd = "0.13.2"
 
 [dev-dependencies]

diff --git a/doc/erofs.md b/doc/erofs.md
diff --git a/doc/image-format.md b/doc/image-format.md
@@ -0,0 +1,276 @@
+# Canonical composefs file format
+
+## Prelude
+
+We expect the process of creating an erofs from a filesystem image to be
+deterministic.  `erofs` is very free-form and there are many ways things could
+be organized.
+
+Here's where we try to document some of the decisions we make.  This documents
+the erofs images produced by the `composefs` rust crate, which are currently
+different from the official `composefs` repository (ie: `libcomposefs`, in C).
+It would be very desirable to try to make this implementation exactly match the
+`libcomposefs` implementation so that we could check them against each other to
+ensure that they produce bitwise identical output.  On the other hand, we've
+been discussing creating a "version 1.1" format, and this might be a good
+jumping-off spot for that.
+
+The goal of this document is to completely and unambiguously document every
+decision we made in such a way that you could use this document as a guide to
+produce a new composefs erofs writer implementation, from scratch, which
+produces exactly the same output.  However, this document is probably currently
+very incomplete, and maybe even incorrect.  We should strive to cover every
+possible detail here, but it's hard.  Hopefully things will improve with time,
+but until then, you might need to check the implementation.
+
+In cases of ambiguity or incorrectness, issues and patches are extremely
+welcome.
+
+## Overall layout concept
+
+The composefs header and superblock are the only things that need to be at
+fixed offsets.  How do we organize everything else?
+
+Generally speaking, we perform these steps:
+*    collect the filesystem into a flat list of inodes
+*    collect and "share" xattrs, as appropriate
+*    write the composefs header and the superblock
+*    write the inodes directly following the superblock
+*    write the shared xattrs directly following the inodes
+*    then the blocks (only for directories)
+
+## Collecting inodes
+
+We collect the inodes into a flat list according to the following algorithm:
+*   our goal is to visit each inode, collecting it into the inode list as we
+    visit it, in the order that we visited it
+*   start at the root directory
+*   for each directory that we visit:
+    -   the directory is stored first, then the children
+    -   we visit the children in asciibetical order, regardless of file type
+        (ie: we interleave directories and regular files)
+    -   when visiting a child directory, we store all content of the child
+        directory before returning to the parent directory (ie: depth first)
+*   in the case of hardlinks, the inode gets added to the list at the spot that
+    the first link was encountered
+
+Consider a filesystem tree
+
+```
+ /
+   bin/
+     cfsctl
+   usr/
+     lib/
+       libcomposefs.so
+       libglib-2.0.so
+     libexec/
+       cfsctl
+```
+
+where `/bin/cfsctl` and `/usr/libexec/cfsctl` are hardlinks.
+
+In that case, we'd collect the inodes in this order:
+1.  `/`
+1.  `/bin/`
+1.  `/bin/cfsctl` (aka `/usr/libexec/cfsctl`)
+1.  `/usr/`
+1.  `/usr/lib/`
+1.  `/usr/lib/libcomposefs.so`
+1.  `/usr/lib/libglib-2.0.so`
+1.  `/usr/libexec/`
+
+(skipping `/usr/libexec/ctlctl` because we already had it by the time we encountered it).
+
+So that's 8 inodes, in that order.
+
+## Special handling for overlayfs
+
+Ultimately, the erofs image that we produce needs to be used as a layer in an
+overlayfs stack.  There are a lot of cases where the thing that we write out
+only makes sense to overlayfs.  There are other cases where we need to avoiding
+writing out things that overlayfs would treat as "special".
+
+`libcomposefs` writes 256 files named from `00` to `ff` into the root directory
+as character devices with major/minor of (0, 0).  Those are overlayfs whiteouts
+and they are needed for older versions of overlayfs which don't support "data
+only" layers.  We don't target these versions, so *we don't add these files*.
+We also don't mark the root directory as opaque or do anything else special
+with it.
+
+Conversely, if we encounter a character device with major/minor (0, 0) then we
+need to escape it to make sure that it appears as such in the final composed
+image (and does not get handled by overlayfs as a whiteout).  We do that by:
+TODO (not implemented yet).
+
+We also need to make sure that the only `trusted.overlay.*` attributes which we
+write are ones that came from us.  If we encounter any `trusted.overlay.*`
+attributes in the source, we escape them to `trusted.overlay.overlay.`, causing
+them to lose their special meaning.
+
+## Extended attribute handling
+
+For each inode, we collect and write the extended attributes in asciibetical
+order, by full name.  Note: this is different than the shared xattr table which
+has a more complicated sorting, but maybe we want to unify the two.
+
+We use the hardcoded prefix indexes (which is actually mandatory).
+
+We don't use "long prefixes", but we might start doing that at some point,
+because it would sure be nice to not have to write `"overlay.redirect"`,
+`"overlay.metacopy"` and `"selinux"` over and over again. The feature seems
+complicated, though...
+
+## Collecting shared xattrs
+
+`erofs` has a facility for sharing xattrs where the name and the value are
+identical, and we use it.  After we've collected all of our inodes, we iterate
+the list and take note of all (name, value) pairs.  If any (name, value) pair
+appears more than once, we share it.
+
+The process of "sharing" involves modifying the original inode.  We iterate the
+present xattrs, and for each attribute that we share, we remove it from the
+"inline" list and add it to the "shared" list, in the same order as it appeared
+in the inline list.
+
+NB: this operation is performed on the flattened inode list, not the directory
+tree.  That means that if a particular (name, value) pair appears uniquely on
+an inode with multiple hardlinks, we'll count that as a single occurrence and
+it won't be shared.
+
+Note also: the attributes that we add ourselves are considered candidates for
+sharing.  That means that if we had two external files which were not hardlinks
+but nevertheless contained the same data, we'd end up sharing their
+`trusted.overlayfs.` attributes.
+
+## The composefs header
+
+`erofs` leaves the first 1024 bytes of the file free to us, and we store a
+32-byte header at offset 0.  The kernel ignores this, and our mount code
+doesn't actually do anything with it at the moment, either.  We try to fill it
+out in the same way as `libcomposefs`:
+
+*   `magic` (`u32`): `0xd078629a`
+*   `version` (`u32`): I think this is something like the overall file format
+    version.  If this changes, then things are possibly incompatible, and maybe
+    this isn't even an `erofs` anymore.  Currently `1`.
+*   `flags`: `0`
+*   `composefs_version`: I think this is something like a statement about the
+    current strategy for layout decisions.  If this changes, the algorithm for
+    building the file has probably decided to put things in different places
+    (and the checksum of the file will have changed), but the result is still
+    understandable as an `erofs`.  Currently `1`.
+
+## The superblock
+
+*   `checksum`: we don't fill that out
+*   `feature_compat`: we set `MTIME` and `XATTR_FILTER`
+*   `blkszbits`: we use 12, for a block size of 4096
+*   `root_nid`: that's going to end up being 36, which follows from the fact
+    that we put the root inode directly following the superblock, at offset
+    `1024 + 128` = `1152`.  `1152 / 32` = `36`.
+*   `inos`: we currently set that to the number of inodes in the filesystem.
+    `libcomposefs` adds some extra file content (the `00`..`ff` whiteouts) so
+    it gets a larger number than we do.
+*   `blocks`: the total filesize, divided by 4096.
+*   `build_time`, `build_time_nsec`: since we only use extended format inodes,
+    these fields are meaningless and we currently set them to 0 (which is
+    different from `libcomposefs`).
+*   `meta_blkaddr`, `xattr_blkaddr`.  We currently set both of these to 0 to
+    keep things simple. `libcomposefs` performs a complicated calculation to
+    set `meta_blkaddr` to zero as well (since the first inode directly follows
+    the superblock, it will always be within the first 4096 byte filesystem
+    block), but its complicated calculation for `xattr_blkaddr` might well land
+    on a non-zero value, so that's different from us.
+
+## The inodes
+
+After the superblock, we write the inodes.  Some notes:
+
+*   we only use extended inodes, because mtime is important to us and we
+    generally expect every file to have a unique mtime.  This is a difference
+    from `libcomposefs`.
+
+*   we use a "chunk based" data layout for non-inline regular files:
+
+    -   the way this works in overlayfs, we want to store a correctly-sized
+        sparse file in the upper layer.  This lets us have the correct `size`
+        field on the inode, so we don't need to interact with the data layer in
+        order to do `stat()`.
+
+    -   we set the chunk format (ie: the `i_u` field) to 31, the maximum
+
+    -   we store a single "null" chunk pointer
+
+    -   this corresponds to a chunk size of 8TB, which is then the upper limit
+        of files we can store
+
+    -   `libcomposefs` tries to take the smallest chunk format value which will
+        get the job done with a single chunk pointer, and will write multiple
+        chunk pointers if necessary (for extreemely large files). Maybe we
+        should do that too.
+
+    -   in this case we set the `trusted.overlay.metacopy` and
+        `trusted.overlay.redirect` attributes (in that order) on the file.
+        These attributes are written first, before the other attributes that
+        would be present on the same file (which are otherwise in sorted
+        order).
+
+    -   the `trusted.overlay.metacopy` attribute is 36 bytes long, and is set to:
+        +   the 4-byte header: [0 36, 0, 1]
+        +   the 32-byte SHA256 fs-verity digest
+
+    -   the `trusted.overlay.redirect` attribute is set to the string
+        `"/xx/yyyy..."` where `xx` is the first two lowercase hexidecimal bytes
+        of the fs-verity digest and the `yyyy...` is the rest.  That's just a
+        reference into the `objects/` subdirectory of the repository (which is
+        mounted in the overlayfs stack as the data layer).
+
+*   we use a "flat inline" data layout for all other inodes:
+
+    -   for character and block devices, as well as fifos and sockets this is
+        meaningless, but we need to set something
+
+    -   for inline regular files we store the content inline.  This will break
+        if we try to inline a file larger than 4095 characters, but our current
+        cut-off is 64.
+
+    -   for symlinks this means that the link target gets stored inline.
+        Hopefully we don't have symlinks with targets longer than 4095
+        characters, or we're gonna get in trouble.
+
+    -   directories may well be larger than 4096 bytes, so we might end up
+        needing to store blocks for those.  These follow the "shared xattrs"
+        area.  We could probably set "flat plain" for directories that are an
+        exact multiple of 4096 bytes in size, and `libcomposefs` does that, but
+        we don't bother.
+
+We pad the last inode to the required alignment for inodes, even though it is
+generally followed by a shared xattr (which has a less stringent alignment
+requirement).
+
+## The shared xattrs
+
+There's not much left to be said about these.  We currently write them out in
+the order that `collections::BTreeMap` applies to our `struct XAttr`, which I
+think basically ends up sorting them by prefix index, then by suffix, then by
+value.  We might like to firm that up at some point.  This is notably different
+than the sorting applied to the attributes as they appear in the inodes, and we
+also don't give any special treatment to the `trusted.overlay.` attributes that
+we added: they're sorted here in the usual way.
+
+After we do this, and even if there was no shared xattrs, we always pad up to a
+4096 byte boundary, even if there are no data blocks.  That means that the
+filesystem image will always be a multiple of 4096.
+
+## The blocks
+
+Now comes the data blocks.  These are written in sequence for each inode,
+according to the sequence of the inode in the inode list.  Due to our use of
+"flat inline" data layout, only full blocks are stored (although they may have
+included inter-block padding in directories), so we keep 4096-byte alignment
+from here on out.
+
+## The end
+
+That's it.  The file is over now.  We'll have ended on a multiple of 4096.
diff --git a/src/bin/cfsctl.rs b/src/bin/cfsctl.rs
@@ -97,6 +97,8 @@ enum Command {
 }
 
 fn main() -> Result<()> {
+    env_logger::init();
+
     let args = App::parse();
 
     let repo = (if let Some(path) = args.repo {

diff --git a/src/bin/erofs-debug.rs b/src/bin/erofs-debug.rs
@@ -0,0 +1,25 @@
+use std::{fs::File, io::Read, path::PathBuf};
+
+use clap::Parser;
+
+use composefs::erofs::debug::debug_img;
+
+/// Produce a detailed dump of an entire erofs image
+///
+/// The output is in a diff-friendly format, such that every distinct image produces a distinct
+/// output (ie: an injective mapping).  This is useful for determining the exact ways in which two
+/// different images are different.
+#[derive(Parser)]
+struct Args {
+    /// The path to the image file to dump
+    image: PathBuf,
+}
+
+fn main() {
+    let args = Args::parse();
+    let mut image = File::open(args.image).expect("Opening file");
+
+    let mut data = vec![];
+    image.read_to_end(&mut data).expect("read_to_end() failed");
+    debug_img(&data);
+}