From bb17881e30f49a9fa12403f52f7ae258a978c6da Mon Sep 17 00:00:00 2001
From: Ed Morley <501702+edmorley@users.noreply.github.com>
Date: Fri, 15 Mar 2024 12:49:06 +0000
Subject: [PATCH] Significantly reduce size of ext3 `.img` assets (#263)

Our base images are built as Docker images and on release, published to
Docker Hub. However, the Heroku platform currently doesn't use those
images directly.

Instead, during release `ext3` formatted `.img` files are generated from
each Docker image, which are gzipped and uploaded to S3. At runtime
these are then mounted as a loopback device. For more background on
this, see:
https://github.com/heroku/base-images/pull/42#issuecomment-250837783

Previously each `.img` file was created at a fixed size of 2400 MiB,
thanks to the `bs=100M count=24` arguments to `dd` (24 x 100 MiB
blocks):
https://manpages.ubuntu.com/manpages/jammy/en/man1/dd.1.html

However, this is significantly oversized - for example with Heroku-22's
run image utilisation is at 29%:

```
$ df --human-readable /tmp/heroku-20
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop3      2.3G  654M  1.6G  29% /tmp/heroku-20
```

This 1.6 GiB of free space is not required, since the image will be
mounted as read-only at runtime (the app's own storage lives in separate
mounts).

At first glance this over-sizing might not seem like an issue, since
`dd` was invoked with `if=/dev/zero` so the empty space is zeroed out,
and therefore compresses very well (the Heroku-22 run image gzips down
to 216 MiB) - meaning bytes over the wire and S3 storage costs are not
impacted.

However, on the runtime hosts these images have to be stored/used
uncompressed - and in large quantity due to the number of permutations
of stack versions/variants we've accumulated over time (`cedar{,-14}`,
`heroku-{16,18,20,22,24}{,-build}`). In addition, during a base image
release, the Common Runtime hosts have to store both the old and new
releases on disk side by side until old dynos cycle - meaning the high
water mark for disk usage is doubled for each non-EOL stack.

With the addition of Heroku-24 to staging recently (which increased the
storage requirements high water mark by 9.4 GiB due to the above), this
resulted in disk space exhaustion on one partition of some of the
single-dyno dedicated instance types:
https://salesforce-internal.slack.com/archives/C01R6FJ738U/p1710236577625989

I did some research to check whether there was a specific reason for
over-sizing, and found that the current `bs=100M count=24` arguments
date back to 2011:
https://github.com/heroku/capturestack/commit/8821890894a7521791e81e8bf8f6ab2b31c93c8e

The 2400 MiB figure seems to have been picked fairly arbitrarily - to
roughly fit the larger images at that time with some additional
headroom. In addition, I doubt disk usage was a concern since back then
there weren't the single-dyno instance types (which have less allocated
storage than the multi-tenant instances) or the 12x stack versions
/variants we've accumulated since.

As such, rather than increase the allocated EBS storage fleet-wide to
support the Heroku-24 rollout, we can offset the increase for Heroku-24
(and in fact reduce overall storage requirements significantly), by
instead dynamically sizing the `.img` files - basing their size on that
of the base image contents they hold.

To do this I've chosen to create the `.img` file at an appropriate size
up-front rather than try to shrink it afterwards, since the process of
shrinking would be fairly involved (eg: https://superuser.com/a/1771500),
require a lot more research/testing, and only gain us a couple of MiB
additional savings. The `.img` file format will also eventually be
sunset with the move to CNBs / OCI images instead of slugs.

I've also added the printing of disk utilisation during the `.img`
generation process, which allows us to see the changes in image size:

### Before

```
Filesystem  Size  Used Avail Use% Mounted on
/dev/loop3  2.3G  654M  1.6G  29% /tmp/heroku-20
/dev/loop3  2.3G  1.5G  770M  67% /tmp/heroku-20-build
/dev/loop3  2.3G  661M  1.6G  30% /tmp/heroku-22
/dev/loop3  2.3G  1.1G  1.2G  46% /tmp/heroku-22-build
/dev/loop3  2.3G  669M  1.6G  30% /tmp/heroku-24
/dev/loop3  2.3G  1.2G  1.1G  51% /tmp/heroku-24-build

Total: 14400 MiB
```

### After

```
Filesystem  Size  Used Avail Use% Mounted on
/dev/loop3  670M  654M  8.7M  99% /tmp/heroku-20
/dev/loop3  1.6G  1.5G   23M  99% /tmp/heroku-20-build
/dev/loop3  678M  660M   11M  99% /tmp/heroku-22
/dev/loop3  1.1G  1.1G  6.8M 100% /tmp/heroku-22-build
/dev/loop3  686M  669M   10M  99% /tmp/heroku-24
/dev/loop3  1.2G  1.2G   11M 100% /tmp/heroku-24-build

Total: 6027 MiB
```

Across those 6 actively updated (non-EOL) stack variants we save 8.2 GiB,
which translates to a 16.4 GiB reduction in the high-water mark storage
requirements for every Common Runtime instance in the fleet, and an
8.2 GiB reduction for every Private Spaces runtime node (which receive
updates via the AMI so don't have double the images during new releases).

There is also potentially another ~6.5 GiB savings to be had from
repacking the `.img` files for the last release of each of the 6 EOL
stacks versions/variants, however, since those stacks are no longer
built/released that would need a more involved repacking approach.
(Plus since these stacks aren't updated, they don't cause double the
usage requirements for Common Runtime during releases, so the realised
overall storage requirements reduction would be less.)

Docs for the various related tools:
https://manpages.ubuntu.com/manpages/jammy/en/man1/du.1.html
https://manpages.ubuntu.com/manpages/jammy/en/man1/df.1.html
https://manpages.ubuntu.com/manpages/jammy/en/man1/dd.1.html
https://manpages.ubuntu.com/manpages/jammy/en/man1/fallocate.1.html

GUS-W-15245261.
---
 tools/bin/capture-docker-stack  |  8 +++++++-
 tools/bin/make-filesystem-image | 17 ++++++++++++++++-
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/tools/bin/capture-docker-stack b/tools/bin/capture-docker-stack
index 16b0a5c0..8df9d7df 100755
--- a/tools/bin/capture-docker-stack
+++ b/tools/bin/capture-docker-stack
@@ -11,6 +11,11 @@ STACK_VERSION=$(echo "${STACK}" | cut -d '-' -f 2-)
 
 DOCKER_IMAGE=heroku/$STACK_NAME:$STACK_VERSION
 DOCKER_IMAGE_VERSION=$(docker inspect "${DOCKER_IMAGE}" | jq .[].Id | cut -d ':' -f 2 | cut -b 1-12)
+# Using `du` rather than the `Size` attribute from Docker inspect, since the latter appears to:
+#  - Under-report usage slightly when using the overlay2 storage driver
+#  - Be the compressed image size (instead of uncompressed) when using the containerd snapshotter
+# The `--user root` is required since the images for newer stacks default to a non-root user.
+DOCKER_IMAGE_SIZE_IN_MB=$(docker run --rm --platform linux/amd64 --user root "${DOCKER_IMAGE}" du -sx --block-size=M | cut -d 'M' -f 1)
 
 IMG_BASE=${STACK_NAME}64-$STACK_VERSION-$DOCKER_IMAGE_VERSION
 IMG=/tmp/$IMG_BASE.img
@@ -23,7 +28,7 @@ IMG_PKG_VERSIONS=/tmp/$IMG_BASE.pkg.versions
 display "Starting capture for ${STACK} ${DOCKER_IMAGE_VERSION} at $(date)"
 
 display "Creating image file ${IMG}"
-make-filesystem-image "${IMG}" |& indent
+make-filesystem-image "${IMG}" "${DOCKER_IMAGE_SIZE_IN_MB}" |& indent
 
 display "Mounting image ${IMG_MNT}"
 mount-filesystem-image "${IMG}" "${IMG_MNT}" |& indent
@@ -35,6 +40,7 @@ display "Modifying image directories and files"
 install-heroku-files "${IMG_MNT}" |& indent
 
 display "Unmounting image"
+df --human-readable "${IMG_MNT}" |& indent
 umount "${IMG_MNT}" |& indent
 
 display "SHA256ing and gzipping image"
diff --git a/tools/bin/make-filesystem-image b/tools/bin/make-filesystem-image
index d3ad7802..932d77b0 100755
--- a/tools/bin/make-filesystem-image
+++ b/tools/bin/make-filesystem-image
@@ -3,8 +3,23 @@
 set -euo pipefail
 
 IMG="$1"
+DOCKER_IMAGE_SIZE_IN_MB="$2"
+
+# We have to pick a fixed size in advance for the .img file we create, so base it on the size
+# of the original Docker image to avoid either wasting space or having the later tar extraction
+# step fail with out of disk space errors. The image will be mounted read-only at runtime, so
+# does not need free space for app files (separate mounts are used for those). The multiplier
+# here is to account for the 5-6% loss of usable space due to ext3 filesystem overhead, as well
+# as to ensure a few MB additional free space headroom.
+IMG_SIZE_IN_MB=$((DOCKER_IMAGE_SIZE_IN_MB * 107 / 100))
 
 mkdir -p "$(dirname "$IMG")"
-dd if=/dev/zero of="$IMG" bs=100M count=24
+
+# Create an empty file of the specified size.
+# Using `fallocate` instead of `dd` since it's faster, simpler for this use-case, and doesn't
+# suffer from `dd`'s non-determinism when attempting to copy an exact number of bytes:
+# https://unix.stackexchange.com/a/121888
+fallocate --length "${IMG_SIZE_IN_MB}MiB" "${IMG}"
+
 mkfs -t ext3 -m 1 "$IMG"
 tune2fs -c 0 -i 0 "$IMG"