-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Game out a plan for a 1.1 format #198
Comments
It may actually be worth adding a stub 1.1 format now that has a trivial change as a hidden option just to really test things out. |
There is already a |
If we wanted a stupid optional feature we could have one that skips the 00-ff whiteouts in the image. That means its only going to work well (i.e. the basedir would not be visible) with kernels that have data-only overlayfs layers, but for those it would be more efficient. |
Other potential wishlist item for a trivial change to make things more efficient: more aggressive list of xattr prefixes. We should really have "prefixes" for the complete length of all of the overlayfs xattrs we output. |
The use of custom prefixes would be nice, but it does bump up the kernel requirements to 6.4. |
Having implemented a second erofs writer, this is something like my list of proposed changes for composefs erofs v1.1 file format:
|
Thanks, that's a good list!
Did you mean no extended inodes? |
No. Compact inodes don't have an mtime field, which means we need extended inodes. If you write a compact inode then the mtime is equal to the mtime set in the superblock, which means that we basically get to write a single compact inode in the general case*, and the rest of them will be extended. It just seems like it's not worth the trouble.
@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though. |
Yes, currently the EROFS core on-disk format is still the same as the initial version. I'm considering gathering all new ideas and requirements to refine a new revised on-disk format in a completely compatible way (and there shouldn't be any major change.) But I tend to land these on-disk changes in the exact one kernel version (IOWs, avoid changes scattered several versions, which is bad for all mkfses), I think I will sort them out in 2025. I will invite all to review these changes if interested to get a nicer solution for all use cases.. |
It occurs to me that the current order used by libcomposefs is harder to implement but probably has performance benefits. Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the Another proposal in terms of keeping inodes tightly packed, though (after some IRC conversation with @hsiangkao): it might be nice to substantially decrease the amount of inlining we do and then try our hardest to make sure that we always fit complete inodes into blocks. This means that We might also try to take a more holistic approach to allocating inodes within a single directory so that they all fit into a single page. This is getting into substantially more complicated territory, though, so it might make sense to take a pass on it. As it is, the current ordering that libcomposefs employs is already pretty good. We could also make inlining dependant on the alignment that we find ourselves in when we go to write the inode. For example: if we see that we could write a 2k inline section without inserting additional padding, just go ahead and do it. If not, then write the inode "flat plain" and store the data in a block. We might come up with some sort of a more dynamic approach for "amount of padding we'd require" vs "amount of space we'd waste by shoving the data into a block" with a heavy preference to avoiding additional padding in the inode area, but this is again starting to sound a bit too complicated for my tastes. We might also say more static things like "we always inline things less than 128 (or 256) bytes, even if we have to insert padding", knowing that the amount of padding we'd have to insert will be small. Another way we could keep inodes compact is to "share" large xattrs even if they're unique. And we could also make these decisions dynamically based on alignment and our ability to write the inode into a single block without padding. I suspect that there's again not too much benefit to be had here, though. |
Let's assume 1.0 is released, and we discover something like a notable performance issue with the 1.0 format. Or maybe it's actually broken in an important corner case on big-endian (s390x) - something like that.
Say this is important enough to do a 1.1.
I think the way this would need to work is basically we add support for e.g.
--format=1.1
to the CLI/API - and then we generate both digests.We need to think through and verify a scenario like this would work:
Right?
The text was updated successfully, but these errors were encountered: