Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Game out a plan for a 1.1 format #198

Open
cgwalters opened this issue Sep 22, 2023 · 10 comments
Open

Game out a plan for a 1.1 format #198

cgwalters opened this issue Sep 22, 2023 · 10 comments
Milestone

Comments

@cgwalters
Copy link
Contributor

Let's assume 1.0 is released, and we discover something like a notable performance issue with the 1.0 format. Or maybe it's actually broken in an important corner case on big-endian (s390x) - something like that.

Say this is important enough to do a 1.1.

I think the way this would need to work is basically we add support for e.g. --format=1.1 to the CLI/API - and then we generate both digests.

We need to think through and verify a scenario like this would work:

  • Build server is updated with composefs 1.1 support
  • Client system (e.g. ostree/container tooling/RAUC/whatever) only has 1.0 support
  • Client fetches metadata (OCI image, whatever) that has both digests
  • Client ignores the 1.1 digest, synthesizes a 1.0 format, and successfully verifies it against the 1.0 digest
  • Client later than updates to tooling (podman/ostree/RAUC) that supports 1.1
  • Client synthesizes a 1.1 EROFS, and verifies using that digest

Right?

@cgwalters cgwalters added this to the 1.0 milestone Sep 22, 2023
@cgwalters
Copy link
Contributor Author

It may actually be worth adding a stub 1.1 format now that has a trivial change as a hidden option just to really test things out.

@alexlarsson
Copy link
Collaborator

There is already a uint32_t version in struct lcfs_write_options_s for this particular reason. But you're right, we should plumb it to the CLI and actually test it.

@alexlarsson
Copy link
Collaborator

If we wanted a stupid optional feature we could have one that skips the 00-ff whiteouts in the image. That means its only going to work well (i.e. the basedir would not be visible) with kernels that have data-only overlayfs layers, but for those it would be more efficient.

@allisonkarlitskaya
Copy link
Collaborator

Other potential wishlist item for a trivial change to make things more efficient: more aggressive list of xattr prefixes. We should really have "prefixes" for the complete length of all of the overlayfs xattrs we output.

@alexlarsson
Copy link
Collaborator

The use of custom prefixes would be nice, but it does bump up the kernel requirements to 6.4.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Dec 17, 2024

Having implemented a second erofs writer, this is something like my list of proposed changes for composefs erofs v1.1 file format:

  • no compact inodes
    • store a 1970 superblock mtime; or
    • use the root dir mtime; or
    • use the newest file?
  • set the 32-bit compatible ino field equal to the 64-bit nid value
  • regular files and symlinks can only be inlined
    • limit these to 2k (or 1k)
    • we use blocks only for directories
  • considering dropping our inlining limit to only cover really small directories (to help keep inodes small and compacted together)
  • fix the size calculation for directories to report the actual size of the last block (when not inlining) instead of rounding up to the block size multiple
  • very clearly specify inode alignment-adjustment algorithm for ensuring that inline data doesn't cross block boundaries
  • don't treat symlinks differently for alignment
  • use chunk format 31 for all sparse files
  • don't use a start offset for xattrs
  • don't try to calculate start offset for inodes (it's always 0)
  • don't store the whiteouts for 00-ff (requires data layer)
  • consider outright banning the presence of (0, 0) character devices
    • the escaping is extremely annoying (affecting all parent directories)
  • use custom xattr prefixes for all of our xattrs
    • metacopy
    • redirect
    • whiteout-related (still needed for c 0 0 files)
    • selinux
    • possibly also ones we see from the user?
  • change inode order:
    • directory and then all children, but depth first
    • current order is all children first and then descend
    • ideally we'd write directories after children but this is complicated by root_nid needing to be small. we could special-case it.
      • erofs is considering adding an extension for larger root nid
  • pin down sort order for shared xattrs
    • prefix index, then suffix, then value

@cgwalters
Copy link
Contributor Author

Thanks, that's a good list!

no compact inodes

Did you mean no extended inodes?

@allisonkarlitskaya
Copy link
Collaborator

Did you mean no extended inodes?

No. Compact inodes don't have an mtime field, which means we need extended inodes. If you write a compact inode then the mtime is equal to the mtime set in the superblock, which means that we basically get to write a single compact inode in the general case*, and the rest of them will be extended. It just seems like it's not worth the trouble.

  • the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

@hsiangkao
Copy link
Contributor

hsiangkao commented Dec 18, 2024

  • the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

Yes, currently the EROFS core on-disk format is still the same as the initial version. I'm considering gathering all new ideas and requirements to refine a new revised on-disk format in a completely compatible way (and there shouldn't be any major change.)

But I tend to land these on-disk changes in the exact one kernel version (IOWs, avoid changes scattered several versions, which is bad for all mkfses), I think I will sort them out in 2025. I will invite all to review these changes if interested to get a nicer solution for all use cases..

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Dec 18, 2024

It occurs to me that the current order used by libcomposefs is harder to implement but probably has performance benefits. Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the ls -l case. Doing a depth-first approach would result in substantially more scattering. I think I'd probably rescind this recommended change.

Another proposal in terms of keeping inodes tightly packed, though (after some IRC conversation with @hsiangkao): it might be nice to substantially decrease the amount of inlining we do and then try our hardest to make sure that we always fit complete inodes into blocks. This means that fstat() of an O_PATH would always be a single-page operation. The only place we might get into trouble is with a large amount/size of xattrs, but we could treat that case as degenerate.

We might also try to take a more holistic approach to allocating inodes within a single directory so that they all fit into a single page. This is getting into substantially more complicated territory, though, so it might make sense to take a pass on it. As it is, the current ordering that libcomposefs employs is already pretty good.

We could also make inlining dependant on the alignment that we find ourselves in when we go to write the inode. For example: if we see that we could write a 2k inline section without inserting additional padding, just go ahead and do it. If not, then write the inode "flat plain" and store the data in a block. We might come up with some sort of a more dynamic approach for "amount of padding we'd require" vs "amount of space we'd waste by shoving the data into a block" with a heavy preference to avoiding additional padding in the inode area, but this is again starting to sound a bit too complicated for my tastes. We might also say more static things like "we always inline things less than 128 (or 256) bytes, even if we have to insert padding", knowing that the amount of padding we'd have to insert will be small.

Another way we could keep inodes compact is to "share" large xattrs even if they're unique. And we could also make these decisions dynamically based on alignment and our ability to write the inode into a single block without padding. I suspect that there's again not too much benefit to be had here, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants