-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: add internal erofs writer code #57
base: main
Are you sure you want to change the base?
Conversation
This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI. It's currently disabled by default. You can test it by setting the `COMPOSEFS_FORMAT=new` environment variable. This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the `COMPOSEFS_FORMAT=` variable will disable this compatibility mode. There's also a `COMPOSEFS_DUMP_EROFS=1` environment variable (which works with both `mkcomposefs` and our internal code) which will dump the erofs layout for diffing. There's also a standalone `erofs-debug` binary that will do the same. Additionally, this introduces two new files in docs: - a detailed description of the parts of erofs that we use - a document which attempts to describe the decisions made in creating an erofs composefs image (in terms of which order the files are in, etc). The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that. Many thanks to Gao Xiang <[email protected]> for helping clarify many points about the erofs file format for the documentation. Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
c7e6c53
to
60eeba8
Compare
cc @hsiangkao |
ignores it. | ||
- `XATTR_FILTER` (`0x0004`): set if the xattr bloom filter should be | ||
used. Read about this in the inode section. | ||
* `blkszbits`: log2 of the block size. Better set this to 12 (4096). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, also it may be helpful to indicate that it's useful to make the image mountable on any page sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please clarify what you meant by this? Should the block size be larger or smaller than the page size in order to be able to mount it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the modern platforms, the smallist page size is 4KiB. EROFS has supported < page size blocks for all platforms now. So leave it as 4KiB will keep images mountable all the time..
* `blkszbits`: log2 of the block size. Better set this to 12 (4096). | ||
* `root_nid`: the reference to the root inode. See the inodes section for | ||
what that means. Normally inodes are stored in u64, but this is somewhat | ||
randomly a u16, which means that you're gonna need to put the root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our initial use cases, __le16
is enough, however for the recent incremental builds, if the root inode is updated and outplaced update is applied, it may not be enough.
Currently, it's been worked around to inplace update the root inode, but I will find a chance to extend it as 64 bits later.
The basic idea is that you find the prefix for your xattr from the list (like | ||
`user.` or `security.`) and then you store only the "suffix" part, along with | ||
the prefix index. If you can't find a prefix, you use 0 (which is conceptually | ||
a prefix of ""). If the prefix matches the entire name then the suffix is `""`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could define 0
as a valid prefix number, although 0 will be ignored and cannot be listed at all in kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know. I think I didn't fully understand how the "prefix" part of xattrs is so special when I was writing this...
Note: you really need to do this "compression" step, because it's assumed | ||
during the lookup phase. ie: if we're looking for an xattr `"user.xyz"` then | ||
we'll only consider the entries that have the prefix index for `user.` set on | ||
them. If you didn't properly "compress" your xattr names, they won't be found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is due to 0
will be always ignored by the kernel.
There's support in the erofs format for custom prefixes. That's when the high | ||
bit of the prefix index is set. These got added circa kernel version 6.4 with | ||
a patch series ending with `6a318ccd7e08` ("erofs: enable long extended | ||
attribute name prefixes") but aren't documented here because we don't use them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I guess you could consider to use them since this feature is landed before overlayfs data-only
layer feature, so if the new on-disk format uses data-only
layer feature, I guess you could safely use it too.
For some reason a bit value of 1 here indicates the absence of a particular | ||
xattr, which is opposite to the usual arrangement. You'd think it was for | ||
compatibility, but the filter is only engaged if the feature bit is present in | ||
the superblock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is due to incremental builds too, because some on-disk inodes could be previously generated and we don't need to touch them to do an incremental build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Makes sense, thanks!
the kernel. | ||
* `mkfs.erofs` also tries to ensure that the inline data ends in the same | ||
disk block as the last byte of the inode metadata (ie: inode header plus | ||
xattrs). This is theoretically not required by the kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but the whole concept of inlined data is to save extra i/o as well as storage. Although the kernel supports it but it's never a design goal of tail inline
inodes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.
Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.
Do you have benchmarks or anything like that in this direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.
Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.
Do you have benchmarks or anything like that in this direction?
In principle, the kernel doesn't do any policy strategies, because it might be good for your workload, but not good for others (e.g. users just would like to read specific inodes). So you might win for some cases, but lose for other useful cases. It's not quite something to serve for generic use cases.
EROFS always reads needed block-sized metadata when users request for the best random metadata performance as other kernel fses. if you need to warm up metadata, you could use fadvise (maybe as a seperate thread) in advance.
incorrect symlink detection in fast symlink"). | ||
* In general, when faced with the task of writing out an inode with inline | ||
data present, you may need to add padding bytes before the start of the | ||
inode in order to ensure that the inline data falls within a single block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the negative impacts of tail inline
is that
previous in principle you could read 64
(4096/64) inodes at once for each metadata 4k i/o,
now you can only read 1 or 2 inodes if the tail inline is too huge.
It could cause some impacts to ls
workload (because it just read metadata), so I tend to just inline small data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.
Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is find /
instead of ls
for example.
Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....
This is again where it would be nice to say something about benchmarks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?
Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.
Does it matter at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.
Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is
find /
instead ofls
for example.Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....
This is again where it would be nice to say something about benchmarks...
So from the on-disk arrangement, there is no sliver bullet too. But I guess you could also consider fadvise warm up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?
Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.
Does it matter at all?
It should be 100, 4096 might work if the trailing filename has \0
but it's not recommended...
This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI.
It's currently disabled by default. You can test it by setting the
COMPOSEFS_FORMAT=new
environment variable.This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the
COMPOSEFS_FORMAT=
variable will disable this compatibility mode.There's also a
COMPOSEFS_DUMP_EROFS=1
environment variable (which works with bothmkcomposefs
and our internal code) which will dump the erofs layout for diffing. There's also a standaloneerofs-debug
binary that will do the same.Additionally, this introduces two new files in docs:
The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that.
Many thanks to Gao Xiang [email protected] for helping clarify many points about the erofs file format for the documentation.
Closes #56