src: add internal erofs writer code #57

allisonkarlitskaya · 2024-12-16T14:51:24Z

This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI.

It's currently disabled by default. You can test it by setting the COMPOSEFS_FORMAT=new environment variable.

This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the COMPOSEFS_FORMAT= variable will disable this compatibility mode.

There's also a COMPOSEFS_DUMP_EROFS=1 environment variable (which works with both mkcomposefs and our internal code) which will dump the erofs layout for diffing. There's also a standalone erofs-debug binary that will do the same.

Additionally, this introduces two new files in docs:

a detailed description of the parts of erofs that we use
a document which attempts to describe the decisions made in creating an erofs composefs image (in terms of which order the files are in, etc).

The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that.

Many thanks to Gao Xiang [email protected] for helping clarify many points about the erofs file format for the documentation.

Closes #56

This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI. It's currently disabled by default. You can test it by setting the `COMPOSEFS_FORMAT=new` environment variable. This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the `COMPOSEFS_FORMAT=` variable will disable this compatibility mode. There's also a `COMPOSEFS_DUMP_EROFS=1` environment variable (which works with both `mkcomposefs` and our internal code) which will dump the erofs layout for diffing. There's also a standalone `erofs-debug` binary that will do the same. Additionally, this introduces two new files in docs: - a detailed description of the parts of erofs that we use - a document which attempts to describe the decisions made in creating an erofs composefs image (in terms of which order the files are in, etc). The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that. Many thanks to Gao Xiang <[email protected]> for helping clarify many points about the erofs file format for the documentation. Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>

allisonkarlitskaya · 2024-12-16T18:50:42Z

cc @hsiangkao

hsiangkao · 2024-12-17T02:33:04Z

doc/erofs.md

+        ignores it.
+    -   `XATTR_FILTER` (`0x0004`): set if the xattr bloom filter should be
+        used.  Read about this in the inode section.
+*   `blkszbits`: log2 of the block size.  Better set this to 12 (4096).


yes, also it may be helpful to indicate that it's useful to make the image mountable on any page sizes.

Can you please clarify what you meant by this? Should the block size be larger or smaller than the page size in order to be able to mount it?

on the modern platforms, the smallist page size is 4KiB. EROFS has supported < page size blocks for all platforms now. So leave it as 4KiB will keep images mountable all the time..

hsiangkao · 2024-12-17T02:36:19Z

doc/erofs.md

+*   `blkszbits`: log2 of the block size.  Better set this to 12 (4096).
+*   `root_nid`: the reference to the root inode.  See the inodes section for
+    what that means.  Normally inodes are stored in u64, but this is somewhat
+    randomly a u16, which means that you're gonna need to put the root


For our initial use cases, __le16 is enough, however for the recent incremental builds, if the root inode is updated and outplaced update is applied, it may not be enough.
Currently, it's been worked around to inplace update the root inode, but I will find a chance to extend it as 64 bits later.

hsiangkao · 2024-12-17T02:40:23Z

doc/erofs.md

+The basic idea is that you find the prefix for your xattr from the list (like
+`user.` or `security.`) and then you store only the "suffix" part, along with
+the prefix index.  If you can't find a prefix, you use 0 (which is conceptually
+a prefix of "").  If the prefix matches the entire name then the suffix is `""`.


We could define 0 as a valid prefix number, although 0 will be ignored and cannot be listed at all in kernel.

That's good to know. I think I didn't fully understand how the "prefix" part of xattrs is so special when I was writing this...

hsiangkao · 2024-12-17T02:41:47Z

doc/erofs.md

+Note: you really need to do this "compression" step, because it's assumed
+during the lookup phase.  ie: if we're looking for an xattr `"user.xyz"` then
+we'll only consider the entries that have the prefix index for `user.` set on
+them.  If you didn't properly "compress" your xattr names, they won't be found.


Yes, that is due to 0 will be always ignored by the kernel.

hsiangkao · 2024-12-17T02:45:36Z

doc/erofs.md

+There's support in the erofs format for custom prefixes.  That's when the high
+bit of the prefix index is set.  These got added circa kernel version 6.4 with
+a patch series ending with `6a318ccd7e08` ("erofs: enable long extended
+attribute name prefixes") but aren't documented here because we don't use them.


Anyway, I guess you could consider to use them since this feature is landed before overlayfs data-only layer feature, so if the new on-disk format uses data-only layer feature, I guess you could safely use it too.

hsiangkao · 2024-12-17T02:49:21Z

doc/erofs.md

+For some reason a bit value of 1 here indicates the absence of a particular
+xattr, which is opposite to the usual arrangement.  You'd think it was for
+compatibility, but the filter is only engaged if the feature bit is present in
+the superblock.


that is due to incremental builds too, because some on-disk inodes could be previously generated and we don't need to touch them to do an incremental build

Right! Makes sense, thanks!

hsiangkao · 2024-12-17T02:51:19Z

doc/erofs.md

+    the kernel.
+*   `mkfs.erofs` also tries to ensure that the inline data ends in the same
+    disk block as the last byte of the inode metadata (ie: inode header plus
+    xattrs).  This is theoretically not required by the kernel.


yes, but the whole concept of inlined data is to save extra i/o as well as storage. Although the kernel supports it but it's never a design goal of tail inline inodes...

I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.

Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.

Do you have benchmarks or anything like that in this direction?

I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.

Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.

Do you have benchmarks or anything like that in this direction?

In principle, the kernel doesn't do any policy strategies, because it might be good for your workload, but not good for others (e.g. users just would like to read specific inodes). So you might win for some cases, but lose for other useful cases. It's not quite something to serve for generic use cases.

EROFS always reads needed block-sized metadata when users request for the best random metadata performance as other kernel fses. if you need to warm up metadata, you could use fadvise (maybe as a seperate thread) in advance.

hsiangkao · 2024-12-17T02:56:01Z

doc/erofs.md

+    incorrect symlink detection in fast symlink").
+*   In general, when faced with the task of writing out an inode with inline
+    data present, you may need to add padding bytes before the start of the
+    inode in order to ensure that the inline data falls within a single block.


but the negative impacts of tail inline is that
previous in principle you could read 64 (4096/64) inodes at once for each metadata 4k i/o,
now you can only read 1 or 2 inodes if the tail inline is too huge.
It could cause some impacts to ls workload (because it just read metadata), so I tend to just inline small data.

This is a pretty good point.

Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.

Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is find / instead of ls for example.

Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....

This is again where it would be nice to say something about benchmarks...

What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?

Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.

Does it matter at all?

Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.

Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is find / instead of ls for example.

Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....

This is again where it would be nice to say something about benchmarks...

So from the on-disk arrangement, there is no sliver bullet too. But I guess you could also consider fadvise warm up.

What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?

Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.

Does it matter at all?

It should be 100, 4096 might work if the trailing filename has \0 but it's not recommended...

allisonkarlitskaya force-pushed the erofs branch from c7e6c53 to 60eeba8 Compare December 16, 2024 14:58

allisonkarlitskaya marked this pull request as ready for review December 16, 2024 18:50

hsiangkao reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: add internal erofs writer code #57

src: add internal erofs writer code #57

allisonkarlitskaya commented Dec 16, 2024

allisonkarlitskaya commented Dec 16, 2024

hsiangkao Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

hsiangkao Dec 17, 2024

hsiangkao Dec 17, 2024

hsiangkao Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

hsiangkao Dec 17, 2024

hsiangkao Dec 17, 2024

hsiangkao Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

hsiangkao Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

hsiangkao Dec 17, 2024 •

edited

Loading

hsiangkao Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

allisonkarlitskaya Dec 17, 2024

hsiangkao Dec 17, 2024 •

edited

Loading

hsiangkao Dec 17, 2024

src: add internal erofs writer code #57

Are you sure you want to change the base?

src: add internal erofs writer code #57

Conversation

allisonkarlitskaya commented Dec 16, 2024

allisonkarlitskaya commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiangkao Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiangkao Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiangkao Dec 17, 2024 •

edited

Loading

hsiangkao Dec 17, 2024 •

edited

Loading