Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run e2e test on tmpfs #22533

Merged
merged 4 commits into from
May 13, 2024
Merged

run e2e test on tmpfs #22533

merged 4 commits into from
May 13, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Apr 29, 2024

Follow up to commit eaf60c7, lets see how bad things are going to break.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 29, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@edsantiago
Copy link
Member

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

@Luap99
Copy link
Member Author

Luap99 commented Apr 29, 2024

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

Yeah I think the first step is to remove the big images we use, the image dir is already over 1.1G so we should get this down as first step. Maybe we can get away with 4 Gb RAM then.
Also why the hell did this even run if the setup failed?! That needs fixing too.

@edsantiago
Copy link
Member

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

@Luap99
Copy link
Member Author

Luap99 commented Apr 29, 2024

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

yes

@Luap99
Copy link
Member Author

Luap99 commented Apr 29, 2024

Q: why did debian pass?
A: debian VM images do not have tmpfs mounted at /tmp

@edsantiago
Copy link
Member

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

@Luap99
Copy link
Member Author

Luap99 commented Apr 29, 2024

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

filled containers/automation_images#350 to track this, but first let's see how much of a difference this really makes here first

@edsantiago
Copy link
Member

I think you're onto something. Ballpark shows tests running in ~30m, down from ~40

type distro user DB local remote container
int rawhide root 27:37 29:44
int rawhide rootless 28:55
int fedora-39 root 27:42 28:59 !31:15
int fedora-39 rootless 27:02
int fedora-38 root boltdb 38:25 35:19 !29:37
int fedora-38 rootless boltdb !28:17
int debian-13 root 32:49 31:52
int debian-13 rootless 29:21
sys rawhide root 01:04 01:13
sys rawhide rootless 01:07

@Luap99
Copy link
Member Author

Luap99 commented Apr 29, 2024

Yeah that looks like a solid start, not sure how accurate the time report is
https://api.cirrus-ci.com/v1/artifact/task/4837578803773440/runner_stats/int%20podman%20fedora-39%20root%20host%20sqlite-runner_stats.log
https://api.cirrus-ci.com/v1/artifact/task/5963478710616064/runner_stats/int%20podman%20fedora-38%20root%20host%20boltdb-runner_stats.log

We clearly got the system time down a lot which signals IO is a problem, also we still did not max out the CPU per the stats reporting in the cirrus UI so this surprises me a bit because locally this goes full out and I have close to 100% CPU usage on all cores. But maybe the cirrus graph is not counting everything?

@Luap99 Luap99 force-pushed the e2e-tmp-ci branch 2 times, most recently from bd1b2c0 to 59a60bf Compare May 2, 2024 16:32
@Luap99
Copy link
Member Author

Luap99 commented May 2, 2024

@edsantiago What do you think about the TMPDIR change? No idea if it will pass but it gives us a random dir each time so it should not allow anyone to hard code /tmp/something paths. It is always on /tmp though so unless we manually mount some new tmpfs to a random root dir I don't think there is a better way.

@edsantiago
Copy link
Member

I like it.

@edsantiago
Copy link
Member

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

@Luap99
Copy link
Member Author

Luap99 commented May 2, 2024

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

Which is perfect as it shows my check actually works, and yeah I let it run and will continue tomorrow.

@edsantiago
Copy link
Member

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

@Luap99
Copy link
Member Author

Luap99 commented May 3, 2024

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

Which of the failures are you talking about? There just to many to make sense of this. /tmp does not have noexec set, if it would be then no container would be able to run in the e2e tests.

@Luap99
Copy link
Member Author

Luap99 commented May 3, 2024

type distro user DB local remote container
int rawhide root 28:12 26:42
int rawhide rootless 27:36
int fedora-39 root 28:07 27:59 27:15
int fedora-39 rootless 27:04
int fedora-38 root boltdb 29:11 30:03 26:52
int fedora-38 rootless boltdb 29:46
int debian-13 root 29:36 28:29
int debian-13 rootless !29:14

Seems to be a bit under 30m now, looking at other PRs they seem to be in the range of 35-40+ mins so I think it is safe to say this is a noticeable change.

I will try to drop the toolbox change as I think it also helps a bit and I only want to see the tmpfs diff.

Luap99 added a commit to Luap99/libpod that referenced this pull request May 3, 2024
The image is way to big (over 800MB) that slows tests down as we always
have to pull this, the tests itself are also super slow due the
entrypoint logic that we don't care about. We should be testing for
features needed and not specific tools.

I think the current changes should have a similar coverage in terms of
podman features, it no longer tests toolbox but IMO this never was a
task for podman CI tests.

The main driver for this is to make the tests run entirely based on
tmpfs and this image is just to much[1].

[1] containers#22533

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member Author

Luap99 commented May 3, 2024

[+1075s] not ok 285 [125] podman export, alter tarball, re-import
         # tags: distro-integration
         # (in test file test/system/[125-import.bats, line 89](https://github.com/containers/podman/blob/9fa6d0d5cc375790a143f33817303c32b0846be4/test/system/125-import.bats#L89))
         #   `tar -C $PODMAN_TMPDIR -rf $PODMAN_TMPDIR/$b_cnt.tar tmp/testfile2' failed with status 2
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+046ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+051ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+045ms> # quay.io/libpod/testimage:20240123 1f6acd4c4a1d
         #
<+608ms> # # podman build -t before_change_img /tmp/CI_GsPZ/podman_bats.HLYIBo
<+485ms> # STEP 1/3: FROM quay.io/libpod/testimage:20240123
         # STEP 2/3: ADD testfile1 /tmp
         # --> a9813c7c65da
         # STEP 3/3: WORKDIR /tmp
         # COMMIT before_change_img
         # --> ef8a0675b6e9
         # Successfully tagged localhost/before_change_img:latest
         # ef8a0675b6e9ad6e9a85d112ec86f3753290f88b763f592381b0fd406950567d
         #
<+009ms> # # podman create --name before_change_cnt before_change_img
<+085ms> # cb630d35167fa2d230f93076774d4fda2f2747f63f3fb07d3a52a25b65fe2287
         #
<+009ms> # # podman export -o /tmp/CI_GsPZ/podman_bats.HLYIBo/before_change_cnt.tar before_change_cnt
         #
<+159ms> # # podman rm -t 0 -f before_change_cnt
<+075ms> # before_change_cnt
         # # tar --delete -f (tmpdir)/before_change_cnt.tar tmp/testfile1
         # # tar -C (tmpdir) -rf (tmpdir)/before_change_cnt.tar tmp/testfile2
         # tar: Skipping to next header
         # tar: Skipping to next header
         # tar: Exiting with failure status due to previous errors

I suspect we hit another tar bug now in debian, maybe the same as #19407??


Other issue:

rm: cannot remove '/tmp/CI_0FhD/buildah3968082470/mnt': Permission denied

Seems to be the removal of the new TMPDIR I create, not seen on every run but seem to be rootless only so I guess it could be due missing podman unshare and leaked files without different owners...
But the real issue should be we are leaking temporarily buildah files which is really not good.

@cevich
Copy link
Member

cevich commented May 8, 2024

Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.

Yep, np. Was just concerned if maybe there were some failures due to OOM (I haven't looked).

Copy link

Cockpit tests failed for commit 385c4931653d1a66536f6b7c05597dbf584de333. @martinpitt, @jelly, @mvollmer please check.

Luap99 added 4 commits May 13, 2024 17:26
Follow up to commit eaf60c7, with the toolbox image removal it is
possible to run all tests from tmpfs.

Signed-off-by: Paul Holzinger <[email protected]>
First, setup a custom TMPDIR to ensure we have no special assumptions
about hard coded paths. Second, make sure it is actually on a tmpfs so
we can catch regressions in the VM setup immediately.

Signed-off-by: Paul Holzinger <[email protected]>
This reverts commit 02b8fd7.
The new CI images should have a apparmor workaround.

Fixes containers#22625

Signed-off-by: Paul Holzinger <[email protected]>
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Luap99 Luap99 changed the title WIP: run e2e test on tmpfs run e2e test on tmpfs May 13, 2024
@Luap99 Luap99 marked this pull request as ready for review May 13, 2024 15:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2024
@Luap99
Copy link
Member Author

Luap99 commented May 13, 2024

This is good to review now, I think tests are going to pass this time around
@cevich @edsantiago @containers/podman-maintainers PTAL

Copy link

Cockpit tests failed for commit 9233864. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean. LGTM with one suggestion if you need to re-push for other reasons.

Keeping my fingers crossed about the new pasta.

export TMPDIR
fstype=$(findmnt -n -o FSTYPE --target $TMPDIR)
if [[ "$fstype" != "tmpfs" ]]; then
die "The CI test TMPDIR is not on a tmpfs mount, we need tmpfs to make the tests faster"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth a re-push, but: in case of failure, this will be very difficult to debug. A more helpful message might be something like

    die "The CI test TMPDIR ($TMPDIR) fs type is '$fstype'; it should be 'tmpfs' (PR #22533)"

Copy link
Contributor

openshift-ci bot commented May 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago
Copy link
Member

Timing results:

type distro user DB local remote container
int rawhide root 31:02 26:26
int rawhide rootless 29:28
int fedora-40 root 28:04 26:47 26:28
int fedora-40 rootless 27:42
int fedora-39 root boltdb 28:11 30:25 28:29
int fedora-39 rootless boltdb 29:47
int debian-13 root 29:23 28:56
int debian-13 rootless 25:56

Seems to shave 2-5 minutes on podman local, maybe 5-8 remote.

@rhatdan
Copy link
Member

rhatdan commented May 13, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit c9808e7 into containers:main May 13, 2024
88 of 91 checks passed
@Luap99 Luap99 deleted the e2e-tmp-ci branch May 13, 2024 18:32
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 12, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants