Game out a plan for a 1.1 format #198

cgwalters · 2023-09-22T12:09:57Z

Let's assume 1.0 is released, and we discover something like a notable performance issue with the 1.0 format. Or maybe it's actually broken in an important corner case on big-endian (s390x) - something like that.

Say this is important enough to do a 1.1.

I think the way this would need to work is basically we add support for e.g. --format=1.1 to the CLI/API - and then we generate both digests.

We need to think through and verify a scenario like this would work:

Build server is updated with composefs 1.1 support
Client system (e.g. ostree/container tooling/RAUC/whatever) only has 1.0 support
Client fetches metadata (OCI image, whatever) that has both digests
Client ignores the 1.1 digest, synthesizes a 1.0 format, and successfully verifies it against the 1.0 digest
Client later than updates to tooling (podman/ostree/RAUC) that supports 1.1
Client synthesizes a 1.1 EROFS, and verifies using that digest

Right?

The text was updated successfully, but these errors were encountered:

cgwalters · 2023-09-22T12:13:06Z

It may actually be worth adding a stub 1.1 format now that has a trivial change as a hidden option just to really test things out.

alexlarsson · 2023-09-22T14:10:41Z

There is already a uint32_t version in struct lcfs_write_options_s for this particular reason. But you're right, we should plumb it to the CLI and actually test it.

alexlarsson · 2023-09-22T14:15:35Z

If we wanted a stupid optional feature we could have one that skips the 00-ff whiteouts in the image. That means its only going to work well (i.e. the basedir would not be visible) with kernels that have data-only overlayfs layers, but for those it would be more efficient.

allisonkarlitskaya · 2024-10-15T20:26:09Z

Other potential wishlist item for a trivial change to make things more efficient: more aggressive list of xattr prefixes. We should really have "prefixes" for the complete length of all of the overlayfs xattrs we output.

alexlarsson · 2024-10-16T09:53:41Z

The use of custom prefixes would be nice, but it does bump up the kernel requirements to 6.4.

allisonkarlitskaya · 2024-12-17T08:11:26Z

Having implemented a second erofs writer, this is something like my list of proposed changes for composefs erofs v1.1 file format:

no compact inodes
- store a 1970 superblock mtime; or
- use the root dir mtime; or
- use the newest file?
set the 32-bit compatible ino field equal to the 64-bit nid value
regular files and symlinks can only be inlined
- limit these to 2k (or 1k)
- we use blocks only for directories
considering dropping our inlining limit to only cover really small directories (to help keep inodes small and compacted together)
fix the size calculation for directories to report the actual size of the last block (when not inlining) instead of rounding up to the block size multiple
very clearly specify inode alignment-adjustment algorithm for ensuring that inline data doesn't cross block boundaries
don't treat symlinks differently for alignment
use chunk format 31 for all sparse files
don't use a start offset for xattrs
don't try to calculate start offset for inodes (it's always 0)
don't store the whiteouts for 00-ff (requires data layer)
consider outright banning the presence of (0, 0) character devices
- the escaping is extremely annoying (affecting all parent directories)
use custom xattr prefixes for all of our xattrs
- metacopy
- redirect
- whiteout-related (still needed for c 0 0 files)
- selinux
- possibly also ones we see from the user?
change inode order:
- directory and then all children, but depth first
- current order is all children first and then descend
- ideally we'd write directories after children but this is complicated by root_nid needing to be small. we could special-case it.
  - erofs is considering adding an extension for larger root nid
pin down sort order for shared xattrs
- prefix index, then suffix, then value

cgwalters · 2024-12-17T20:20:12Z

Thanks, that's a good list!

no compact inodes

Did you mean no extended inodes?

allisonkarlitskaya · 2024-12-18T08:48:13Z

Did you mean no extended inodes?

No. Compact inodes don't have an mtime field, which means we need extended inodes. If you write a compact inode then the mtime is equal to the mtime set in the superblock, which means that we basically get to write a single compact inode in the general case*, and the rest of them will be extended. It just seems like it's not worth the trouble.

the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

hsiangkao · 2024-12-18T09:01:16Z

the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

Yes, currently the EROFS core on-disk format is still the same as the initial version. I'm considering gathering all new ideas and requirements to refine a new revised on-disk format in a completely compatible way (and there shouldn't be any major change.)

But I tend to land these on-disk changes in the exact one kernel version (IOWs, avoid changes scattered several versions, which is bad for all mkfses), I think I will sort them out in 2025. I will invite all to review these changes if interested to get a nicer solution for all use cases..

allisonkarlitskaya · 2024-12-18T09:03:54Z

It occurs to me that the current order used by libcomposefs is harder to implement but probably has performance benefits. Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the ls -l case. Doing a depth-first approach would result in substantially more scattering. I think I'd probably rescind this recommended change.

Another proposal in terms of keeping inodes tightly packed, though (after some IRC conversation with @hsiangkao): it might be nice to substantially decrease the amount of inlining we do and then try our hardest to make sure that we always fit complete inodes into blocks. This means that fstat() of an O_PATH would always be a single-page operation. The only place we might get into trouble is with a large amount/size of xattrs, but we could treat that case as degenerate.

We might also try to take a more holistic approach to allocating inodes within a single directory so that they all fit into a single page. This is getting into substantially more complicated territory, though, so it might make sense to take a pass on it. As it is, the current ordering that libcomposefs employs is already pretty good.

We could also make inlining dependant on the alignment that we find ourselves in when we go to write the inode. For example: if we see that we could write a 2k inline section without inserting additional padding, just go ahead and do it. If not, then write the inode "flat plain" and store the data in a block. We might come up with some sort of a more dynamic approach for "amount of padding we'd require" vs "amount of space we'd waste by shoving the data into a block" with a heavy preference to avoiding additional padding in the inode area, but this is again starting to sound a bit too complicated for my tastes. We might also say more static things like "we always inline things less than 128 (or 256) bytes, even if we have to insert padding", knowing that the amount of padding we'd have to insert will be small.

Another way we could keep inodes compact is to "share" large xattrs even if they're unique. And we could also make these decisions dynamically based on alignment and our ability to write the inode into a single block without padding. I suspect that there's again not too much benefit to be had here, though.

cgwalters added this to the 1.0 milestone Sep 22, 2023

cgwalters mentioned this issue May 28, 2024

Info drop internal xattrs #288

Draft

allisonkarlitskaya mentioned this issue Dec 9, 2024

Add an internal erofs writer implementation containers/composefs-rs#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Game out a plan for a 1.1 format #198

Game out a plan for a 1.1 format #198

cgwalters commented Sep 22, 2023

cgwalters commented Sep 22, 2023

alexlarsson commented Sep 22, 2023

alexlarsson commented Sep 22, 2023

allisonkarlitskaya commented Oct 15, 2024

alexlarsson commented Oct 16, 2024

allisonkarlitskaya commented Dec 17, 2024 •

edited

Loading

cgwalters commented Dec 17, 2024

allisonkarlitskaya commented Dec 18, 2024

hsiangkao commented Dec 18, 2024 •

edited

Loading

allisonkarlitskaya commented Dec 18, 2024 •

edited

Loading

Game out a plan for a 1.1 format #198

Game out a plan for a 1.1 format #198

Comments

cgwalters commented Sep 22, 2023

cgwalters commented Sep 22, 2023

alexlarsson commented Sep 22, 2023

alexlarsson commented Sep 22, 2023

allisonkarlitskaya commented Oct 15, 2024

alexlarsson commented Oct 16, 2024

allisonkarlitskaya commented Dec 17, 2024 • edited Loading

cgwalters commented Dec 17, 2024

allisonkarlitskaya commented Dec 18, 2024

hsiangkao commented Dec 18, 2024 • edited Loading

allisonkarlitskaya commented Dec 18, 2024 • edited Loading

allisonkarlitskaya commented Dec 17, 2024 •

edited

Loading

hsiangkao commented Dec 18, 2024 •

edited

Loading

allisonkarlitskaya commented Dec 18, 2024 •

edited

Loading