w/ GitLab Registry: "invalid status code from registry 404" #17999

vrothberg · 2023-03-31T11:10:38Z

Discussed in #16842

^{Originally posted by 2xB October 6, 2022}

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When pushing to a GitLab Container Registry, I sometimes randomly get Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found). Weirdly enough, when I upload multiple images from the same base image, I get this error message once for every image I try to push, the second push for each image works again.

Steps to reproduce the issue:

Have a GitLab project with a GitLab Runner that can build and push Podman images (e.g. via Shell executor or custom executor)
Use it very often
Sometimes (rarely) get this error during podman push ...

Describe the results you received:

Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)

Describe the results you expected:

Successfully pushing the image to the GitLab Container Registry

Additional information you deem important (e.g. issue happens only occasionally):

I have a full log of podman --log-level=debug push ... the time it fails. I probably can't post the full log, but if there's something to check in that log, please tell!

Output of podman version:

Client:       Podman Engine
Version:      4.2.1
API Version:  4.2.1
Go Version:   go1.18.5
Built:        Wed Sep  7 19:58:19 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.27.0
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.4-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.4, commit: '
  cpuUtilization:
    idlePercent: 94.36
    systemPercent: 0.85
    userPercent: 4.78
  cpus: 6
  distribution:
    distribution: fedora
    variant: container
    version: "36"
  eventLogger: file
  hostname: 04e8b959c867
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.4.0-126-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 2666459136
  memTotal: 4914487296
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.6-2.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.6
      commit: 18cf2efbb8feb2b2f20e316520e0fd0b6c41ef4d
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 4243320832
  swapTotal: 4294963200
  uptime: 26h 8m 44.00s (Approximately 1.08 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.imagestore: /var/lib/shared
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.9-1.fc36.x86_64
      Version: |-
        fusermount3 version: 3.10.5
        fuse-overlayfs: version 1.9
        FUSE library version 3.10.5
        using FUSE kernel interface version 7.31
    overlay.mountopt: nodev,fsync=0
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 40501673984
  graphRootUsed: 16835969024
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /tmp/custom-executor1517200462
  imageStore:
    number: 57
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.1
  Built: 1662580699
  BuiltTime: Wed Sep  7 19:58:19 2022
  GitCommit: ""
  GoVersion: go1.18.5
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.1

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman/stable from quay.io, run inside another Podman inside a VirtualBox

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

podman/stable from quay.io, run inside another Podman inside a VirtualBox
```</div>

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-03-31T11:12:29Z

Relevant quote from @mtrmac in the discussion:

404 when trying to authenticate? Ugh. The ActiveRecord log above is an explanation for how that 404 happens, but it’s still a server-side bug if it reports “this does not exist” and not “server unavailable”, or whatever the cause is…

Right now this error case does not return a Go-detectable error. That can be fixed, of course.

Automatically retrying on a 404… really seems like a bad papering over of a clear problem; if we were building a list of things to never retry on, a 404 would definitely be on it.

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything (except where the case is known to be local and unfixable, like out of disk space) in c/common/pkg/retry might make sense — as long as we only retry once?

mtrmac · 2023-04-01T08:41:47Z

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything

I think it would also be useful to be quite loud about it (warning-level logs at least).

andrewleech · 2023-04-02T00:48:55Z

I'm planning on raising a matching issue of Gitlab's issue tracker when I'm back in the office next week to get their input.

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404. I think this is a valid response choice from a security perspective, in that the server can't "leak" the existence of a resource you don't have permission to view.

That might be why we get a 404 here, if it's some kind of rate limiting on the auth endpoint?

andrewleech · 2023-04-02T23:57:08Z

I've raised https://gitlab.com/gitlab-org/gitlab/-/issues/404326 to request some feedback from the gitlab side.

mtrmac · 2023-04-03T13:21:08Z

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404.

That concept of pretending something doesn’t exist is fine; but the server still needs to follow other constraints of the protocol.

E.g. that other ticket shows that the server returns a 404 on GET /jwt/auth?…. I read that as claiming not that the parameters refer to an inconsistent repo, but that /jwt/auth itself doesn’t exist, and that doesn’t really make sense.

What is the client to do with that information? We could, reasonably, and consistently with that concept, set up this operation to treat 404 (“does not exist”) the same as 403 (unauthorized) — but 403 is explicitly one of the results where we don’t retry because it wouldn’t help and it could hurt (by locking the user out due to repeated authentication failures).

If the server is encountering some kind of outage / downtime / overload, I think it should indicate an outage / downtime / overload; that reveals nothing about the authentication rules to the client, but it allows the client to make a desirable retry decision.

2xB · 2023-04-05T08:25:48Z

I don't think I have seen this error again after reporting it, although it did occur multiple times before. The only thing I can imagine is that since then, I've regularly updated both GitLab and Podman to the latest version. Potentially it makes sense to track with which Podman and GitLab versions people see this issue.

@mtrmac Regarding I think it would also be useful to be quite loud about it (warning-level logs at least).: For the original bug report I have such a log. As mentioned in the initial bug report, I probably can't post the full log, but if there's something to check in that log, please tell!

mtrmac · 2023-04-05T12:33:58Z

Historically, there was https://gitlab.com/gitlab-org/gitlab/-/issues/215715 , where a repo is expected to be auto-created on a first upload, but if it is accessed while it is being auto-created (in this case, with parallel uploads of multiple layers), some of the attempts may fail. I’m not at all sure this is related in any way, but it is a bit suggestive in that maybe failures encountered while setting up a process might not show up afterwards.

The “quite loud warning-level logs” idea was suggested as something to include in the future auto-retry behavior of Podman, so that there is a trace of a thing failing. It wouldn’t directly help with this situation, especially if the failures no longer occur.

In this case, I think the activerecord::RecordNotFound logs in the GitLab ticket are most important data to allow a code fix.

2xB · 2023-04-05T23:00:33Z

@mtrmac This is an amazing find. I never considered it but yes, when I reported this issue, that was the first push to that image repository and after deleting an image repository and pushing to it again, I again see this issue. A reason why I never considered it is that this is not happening at the beginning of the push transaction, but rather at the end. In any case I can therefore confirm that also for me this happens when repositories are auto-created.

bmaupin · 2023-04-25T18:56:22Z

This happens to me almost every time when pushing to a new image registry. The first push fails, but GitLab creates the registry successfully, and the second push works fine.

I can't reproduce it with docker at all, only buildah. Maybe docker has some kind of built-in retry logic?

bmaupin · 2023-04-25T20:28:06Z

My workaround is to just try the push twice. So in my .gitlab-ci-yml, I replaced

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

with

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG" || sleep 10; buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

mhio · 2023-04-26T04:18:51Z

Maybe docker has some kind of built-in retry logic?

I haven't seen Docker attempt 6 simultaneous auth requests for the JWT bearer token prior to the image push.
If the root cause is a race condition in gitlabs active record, it's hard to trigger that when there is no race.

A buildah login first might work, instead of the push attempting to populate the bearer token .

bmaupin · 2023-04-26T13:40:17Z

A buildah login first might work, instead of the push attempting to populate the bearer token .

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

mtrmac · 2023-04-26T16:21:00Z

Requesting bearer token: invalid status code from registry 404 (Not Found)

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

mhio · 2023-04-27T02:34:44Z

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

Interesting, I had only seen a 404 in the batch of JWT auth requests to gitlab, not in the requests to the registry.

github-actions · 2023-05-28T00:07:06Z

A friendly reminder that this issue had no activity for 30 days.

andrewleech · 2023-05-28T00:21:44Z

Fyi the ticket I raised at gitlab for this issue has been triaged as severity: major so hopefully we'll get some investigating from their end soon

There was also a potential workaround posted that involves building to a different tag name, then retagging to desired tag, though it doesn't make much sense to me as I wouldn't think it'd result in any change to registry api calls:
https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1401274540

mtrmac · 2023-05-29T21:24:02Z

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

I have a draft implementation in containers/image#1968 . Could someone who can reliably reproduce this GitLab failure test a Podman build with that change?

bcg62 · 2023-07-07T15:57:56Z

@mtrmac I can reproduce this, if you or someone has a build available i'll give it a try.

andrewleech · 2023-10-03T03:58:56Z

If anyone has ideas / information regarding reproducing the issue (even just describing your CI setup that does reproduce it) gitlab are trying to investigate it here: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1587264776

vrothberg added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2023

github-actions bot added the stale-issue label May 28, 2023

rhatdan removed the stale-issue label May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w/ GitLab Registry: "invalid status code from registry 404" #17999

w/ GitLab Registry: "invalid status code from registry 404" #17999

vrothberg commented Mar 31, 2023

vrothberg commented Mar 31, 2023

mtrmac commented Apr 1, 2023

andrewleech commented Apr 2, 2023

andrewleech commented Apr 2, 2023

mtrmac commented Apr 3, 2023 •

edited

Loading

2xB commented Apr 5, 2023

mtrmac commented Apr 5, 2023

2xB commented Apr 5, 2023

bmaupin commented Apr 25, 2023

bmaupin commented Apr 25, 2023

mhio commented Apr 26, 2023

bmaupin commented Apr 26, 2023

mtrmac commented Apr 26, 2023

mhio commented Apr 27, 2023

github-actions bot commented May 28, 2023

andrewleech commented May 28, 2023

mtrmac commented May 29, 2023

bcg62 commented Jul 7, 2023

andrewleech commented Oct 3, 2023

w/ GitLab Registry: "invalid status code from registry 404" #17999

w/ GitLab Registry: "invalid status code from registry 404" #17999

Comments

vrothberg commented Mar 31, 2023

Discussed in #16842

vrothberg commented Mar 31, 2023

mtrmac commented Apr 1, 2023

andrewleech commented Apr 2, 2023

andrewleech commented Apr 2, 2023

mtrmac commented Apr 3, 2023 • edited Loading

2xB commented Apr 5, 2023

mtrmac commented Apr 5, 2023

2xB commented Apr 5, 2023

bmaupin commented Apr 25, 2023

bmaupin commented Apr 25, 2023

mhio commented Apr 26, 2023

bmaupin commented Apr 26, 2023

mtrmac commented Apr 26, 2023

mhio commented Apr 27, 2023

github-actions bot commented May 28, 2023

andrewleech commented May 28, 2023

mtrmac commented May 29, 2023

bcg62 commented Jul 7, 2023

andrewleech commented Oct 3, 2023

mtrmac commented Apr 3, 2023 •

edited

Loading