Self-hosted runners is idle but not picking up jobs #120813

ethanppl · 2024-04-23T05:19:04Z

ethanppl
Apr 23, 2024

Select Topic Area

Bug

Body

Problem

I have 8 self-hosted runners that are configured the same. 4 of them started to stop working yesterday (22 Apr) around 08:44 UTC.

When a workflow run, 4 of the workers are active picking up jobs, the other 4 stays idle as if there is no job. The 4 runners are not offline, they are marked as online, connected, idle, yet not picking up jobs. The expected behaviour is all 8 of them can pick up jobs.

It is not related to any concurrency limit, it's consistently that specific 4 runners not picking up the jobs across different repos in the organization.

What it looks like

The runners page in the organization, 4 active, 4 idle:

But at the same time, there are plenty of checks waiting for runner to pick up:

About the runners

All the 8 runners are all Linux runners self-hosted in a Hetzner box in x64 architecture.
All the 8 runners are running the latest actions runner version 2.315.0
All the 8 runners have the same label self-hosted and in the same runner group Default
All the 8 runners are added to the same organization as organization runners
All the 8 runners are set up in the same script.

I couldn't figure out why 4 of them works, why 4 of them suddenly stopped working.

What I have tried to fix it

Stopping the service and then starting them again: sudo ./svc.sh stop and then sudo ./svc.sh start. This does mark the service as offline and then idle again, but still not picking up jobs
Restarting the machine: sudo reboot, similarly the service is offline then online but not picking up jobs
Troubleshooting network connectivity with ./run.sh --check --url <org_url> --pat <my_pat>, all passed

Checking the runner logs in _diag directory, only this error since it stopped working

[2024-04-22 14:31:57Z ERR  MessageListener] Catch exception during get next message.
[2024-04-22 14:31:57Z ERR  MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00.
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
 ---> System.Net.Sockets.SocketException (125): Operation canceled
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
   at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
   at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithNtAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean isProxyAuth, HttpConnection connection, HttpConnectionPool connectionPool, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
   at GitHub.Runner.Listener.MessageListener.GetNextMessageAsync(CancellationToken token)
[2024-04-22 14:31:57Z ERR  MessageListener] #####################################################

I found that others facing this error in this discussion
But following what others did to disabling ipv6 in the service doesn't work, nothing changed

Checking the last worker logs in _diag directory, nothing special

Checking journalctl, no errors logged, as if there is no new job ever requested

Apr 23 04:37:54 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: Starting Runner listener with startup type: service
Apr 23 04:37:54 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: Started listener process, pid: 2811
Apr 23 04:37:54 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: Started running service
Apr 23 04:37:55 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: √ Connected to GitHub
Apr 23 04:37:55 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: Current runner version: '2.315.0'
Apr 23 04:37:55 github-runner-1-ubuntu-4gb-ash runsvc.sh[2804]: 2024-04-23 04:37:55Z: Listening for Jobs

Checking their versions, 2.315.0, all at the latest version
Docker is installed and active (sudo systemctl is-active docker.service)

Anyone faced this before? Any clue what else can I check? Would this be a bug in GitHub side? Thanks in advance!

BbolroC · 2024-04-23T20:09:31Z

BbolroC
Apr 23, 2024

I am facing with the same situation (1 dropped out of 5) for 2.315.0. I have not experienced this in the previous versions.

4 replies

BbolroC Apr 25, 2024

I have upgraded a runner to 2.316.0 and downgraded it to 2.314.1 (which had been working quite some time without the issue), but no success. The runner got into an idle state and not picking up jobs some time after it was registered.

Interestingly, the runners for 2.315.0 which have not been affected by the issue are functioning so far.

ethanppl Apr 25, 2024
Author

It doesn't look like it is version related. I have runners that work in 2.315.0 and runners that doesn't work in 2.315.0. They have the same setup, but some of them doesn't work anymore.

BbolroC Apr 25, 2024

Yeah, I was tryting to deliver your point. Thanks for the confirmation. 😉

dpewaal-iba Dec 4, 2024

Is this issue specific to certain versions of the software?

ohookins · 2024-04-24T06:51:07Z

ohookins
Apr 24, 2024

Also having the same problem here, with the Actions Runner Controller. I suspect the API is having issues (the managed runners are behaving similarly poorly) but the status page is not really reflecting reality.

3 replies

asutosh23 Apr 30, 2024

I am also facing the same issues with the ARC runners
tried to delete the runner deployment and deploy again
worked in few cases, but not for all

ohookins May 4, 2024

The issue I was having earlier was due to actions/actions-runner-controller#3450 - reverting to 0.9.0 of ARC fixed the problem and I'm awaiting the next release which incorporates the fix. Meanwhile I have noticed we also have the issue of a runner starting up properly, connecting to the Github API and then not receiving the queued job. So that seems to align more with the topic of this thread.

satri35 Dec 4, 2024

Do you want me to ask more about the issue with Actions Runner Controller (ARC)?

shashank1992jain · 2024-04-25T05:54:16Z

shashank1992jain
Apr 25, 2024

I am also having exactly same problem.Please can someone advise?
@ethanppl any workaround?

2 replies

ethanppl Apr 25, 2024
Author

I haven't figured it out yet

expert2020h Dec 4, 2024

Can you provide more details about the issue you're facing to help diagnose the problem?

shashank1992jain · 2024-04-26T08:22:46Z

shashank1992jain
Apr 26, 2024

any updates, this issue is blocking a lot of things for us

0 replies

Martiix · 2024-04-26T10:38:39Z

Martiix
Apr 26, 2024

Experiencing the same thing, restarting the runner service doesn't change anything, recreating the runner fixes the issue, but it's a hassle if the runners keep going into idle state and have to be recreated often. Still don't know the cause.

4 replies

Martiix Apr 26, 2024

It might be related to the process of bumping the version though, logs from the runner services indicated that the runner version was upgraded to 2.316.0 and then got stuck in "listening for jobs"

Martiix Apr 26, 2024

This is the sudo systemctl status output for the runner service of a runner stuck in idle mode:

BbolroC Apr 26, 2024

Do you think that the update is triggered even if --disableupdate is passed for registration? I am facing with the same issue with --disableupdate.

Martiix Apr 26, 2024

Then my assumption is likely wrong

shashank1992jain · 2024-04-26T11:32:15Z

shashank1992jain
Apr 26, 2024

I think it was working fine until version 2.304 because there were no issues with that version.Github forces updating the version even if it is not tested properly

0 replies

shashank1992jain · 2024-04-26T17:59:42Z

shashank1992jain
Apr 26, 2024

Recreating a runner makes other runners pick up the jobs automatically. This is very difficult to understand

0 replies

ethanppl · 2024-04-29T05:50:49Z

ethanppl
Apr 29, 2024
Author

2 runners started to pick up jobs now. So instead of 4 out of 8 not working, it improved to 2 out of 8 not working, but I didn't change anything. Probably GitHub is working on something to fix it.

The runners are updated to 2.316.0 but I don't think that matters as 2 of them are still not working.

2 replies

BbolroC Apr 29, 2024

Oh, interesting. some of the idle ones finally woke up?

ethanppl Apr 29, 2024
Author

yes, two of them woke up

Rameeskc · 2024-04-29T06:49:50Z

Rameeskc
Apr 29, 2024

I think it was working fine until version 2.304 because there were no issues with that version.Github forces updating the version even if it is not tested properly

0 replies

jcahigal · 2024-04-29T12:04:29Z

jcahigal
Apr 29, 2024

Same error here, after (auto) updating from 2.315 to 2.316. Reinstalling runners (Current runner version: '2.316.0') works and now they are processing jobs again.

0 replies

shashank1992jain · 2024-04-29T15:58:38Z

shashank1992jain
Apr 29, 2024

@jcahigal but it doesn't last long, it would again fail to pick up jobs if you run it again

9 replies

shashank1992jain Apr 30, 2024

I mean perform another execution

gabrielmocanu Apr 30, 2024

But why don't you try to reconfigure all runners?

shashank1992jain Apr 30, 2024

All runners have been reconfigured but the issue still persists

gabrielmocanu Apr 30, 2024

Hmm you are right, it's seems the issue still persists for me also

jcahigal May 2, 2024

weird, the reinstalled runners, in case, are still working properly. I have workflows triggered with a cron (nightly) and others just for events, all are working ok since I've reinstalled the runners

gabrielmocanu · 2024-04-30T17:03:37Z

gabrielmocanu
Apr 30, 2024

Do you guys use these self-hosted runners most of the time?
I have more self-hosted runners and this issue persists only in one that is running every 10 minutes, I am wondering if this can be the cause.

1 reply

ohookins May 4, 2024

I have self-hosted runners, deployed with ARC (so they are not running except on demand). It seems random chance whether they connect and pick up a job or sit there idle for a long time.

Martiix · 2024-05-02T09:51:25Z

Martiix
May 2, 2024

I have 105 self-hosted runners most running continuously every day. I had two waves of runners going idle, probably related to when they updated. All seems to be in order after the second round of recreating the runners. All runners are currently working as expected.

2 replies

gabrielmocanu May 2, 2024

Same for me, now it seems to work.

ethanppl May 3, 2024
Author

Same, around a few days ago, all runners are working as expected again.

oprokhorov · 2024-07-02T15:24:09Z

oprokhorov
Jul 2, 2024

Have this on 2.317.0, sometime around late June our runner have just stopped picking up jobs. The repo is private, all config.sh checks pass, removing and adding the runner again did not help.
upd: re-installed older 2.316.1 but it didn't work, removed the label and referred to it as 'self-hosted' and it picked up the job, re-added the label again and it works fine on. Not sure which exact one of these actions helped, but okay.
upd Aug 14: re-installed the latest version of the runner and it's working fine since. I think my confusion was related to the fact that i tried to refer to runner by it's name, rather than by it's custom tag. But i still have no idea on what exactly broken runner initially.

3 replies

garrynewman Jul 4, 2024

Same issue this morning. Changed the labels and it started working - thanks!

ethanppl Aug 14, 2024
Author

The problem occurs again. I tried adding and removing labels, but it didn't fix the issue.

I had been referring to self-hosted runners as self-hosted, so I added a new label test to one of the idle runners. Tried running workflows with the new label test and the job is being picked up. The runner that is labelled as test is still idled.

kadosh1000 Dec 4, 2024

Removing and re-adding a label did the trick for me. Thanks !!

theinfinox · 2024-08-14T17:08:04Z

theinfinox
Aug 14, 2024

I feel that the confusion arises from the workflow trying to determine which runner to choose, as they all have the same labels and belong to the same group.

So I suggest that an additional label (like a tag) should be added to distinguish between similar runners
Below arm64 is a additional label, or put like arm64one, arm64two. x64one,...

4 replies

Self-hosted runners is idle but not picking up jobs #120813

Select Topic Area

Body

Problem

What it looks like

About the runners

What I have tried to fix it

Replies: 16 comments · 34 replies

ethanppl Apr 25, 2024 Author

ethanppl Apr 25, 2024 Author

This comment was marked as off-topic.

ethanppl Apr 29, 2024 Author

ethanppl Apr 29, 2024 Author

ethanppl May 3, 2024 Author

ethanppl Aug 14, 2024 Author

Replies: 16 comments 34 replies

ethanppl Apr 25, 2024
Author

ethanppl Apr 25, 2024
Author

ethanppl
Apr 29, 2024
Author

ethanppl Apr 29, 2024
Author

ethanppl May 3, 2024
Author

ethanppl Aug 14, 2024
Author