Add a net health recovery service to qemu machines #21262

n1hility · 2024-01-16T00:09:05Z

There is a network stability issue in qemu + virtio, affecting some users after long periods of usage, which can lead to suspended queue delivery. Until the issue is resolved, add a temporary recovery service which restarts networking when host communication becomes inoperable. Only qemu based machines on mac activate this service as the issue is understood to be qemu specific.

Works around issue in #20639

How to verify:

export CONTAINERS_MACHINE_PROVIDER=qemu
podman machine rm
podman machine init
podman machine start
podman machine ssh
# wait at least 2 minutes until the service becomes active, then take down the network to simulate a failure
ifconfig enp0s1 down
# After a minute networking should resume and the next command prompt should appear

Does this PR introduce a user-facing change?

Add a net recovery service to detect and recover from an inoperable host networking issue experienced by some mac qemu users when ran for long periods of time

n1hility · 2024-01-16T00:12:07Z

FYI @benoitf

// cc @baude @ashley-cui

rhatdan · 2024-01-16T02:58:49Z

pkg/machine/ignition/ignition.go

+sleep 120 # allow time for network setup on initial boot
+while true; do
+  sleep 30
+  curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health


Where does this IP Address come from?

This is the gvproxy gateway address used by the guest. In addition to routing gvproxy runs a built in http server for management of port forwards on this address:

https://github.com/containers/gvisor-tap-vsock/blob/8912b782e96b60da1455bf711eb620d893affa4a/cmd/gvproxy/main.go#L51

benoitf · 2024-01-16T12:11:10Z

pkg/machine/ignition/ignition.go

+  sleep 30
+  curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health
+  if [ "$?" != "0" ]; then
+    echo "bouncing nic due to loss of connectivity with host"


where does this line is reported to ? to see if it occurred or not in my podman machine

The unit file sends stdout&stderr to the system journal, so can be found using journalctl.

ashley-cui · 2024-01-16T13:27:21Z

LGTM

Probably needs a head nod from @baude

cfergeau · 2024-01-16T13:28:48Z

pkg/machine/ignition/ignition.go

+
+sleep 120 # allow time for network setup on initial boot
+while true; do
+  sleep 30


Have you considered using a systemd timer which runs every 30 seconds?

Right yeah only reason it's done this way is to avoid the log generation noise that comes from them that @rhatdan was warning about. With a frequent timer like this it would be a lot of noise.

Ah logging, makes sense!

cfergeau · 2024-01-16T13:29:32Z

pkg/machine/ignition/ignition.go

+# is lost. This is a temporary workaround for a known rare qemu/virtio issue
+# that affects some systems
+
+sleep 120 # allow time for network setup on initial boot


Is this sleep still needed with the systemd unit which has recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service") ?

sshd only has an after on network.target, which is quasi reliable:

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined [snip]"

Since we only seem to see the problem with long running vms, my thinking was it was better to just wait a bit longer in the script without disrupting / delaying boot (e.g. using something like network-online.target).

yeah, I agree sleeping for 2 minutes (or even 5 minutes or 1 hour or ... :) is no big deal given when the bug happens. I just wondered.

cfergeau · 2024-01-16T13:30:02Z

pkg/machine/ignition/ignition.go

+func GetNetRecoveryUnitFile() *parser.UnitFile {
+	recoveryUnit := parser.NewUnitFile()
+	recoveryUnit.Add("Unit", "Description", "Verifies health of network and recovers if necessary")
+	recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service")


Fwiw, I'm not sure sshd.service is required here?

sshd.socket and sshd.service are mutually exclusive alternates (currently our fcos images are using sshd.service atm. Our other units are declared as after both (my assumption is to be compatible if the base images witch to the inet socket approach in the future), so just mirroring that pattern here.

Ah ok, I assumed the FCOS images used sshd.socket already.

benoitf · 2024-01-16T14:29:20Z

I'm using the patch since this morning

journalctl | grep bouncing
Jan 16 15:08:10 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:09:21 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:15:36 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host

I already hit the bug but my podman machine is still reachable

cfergeau · 2024-01-16T14:29:53Z

/lgtm

openshift-ci · 2024-01-16T14:29:56Z

@cfergeau: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

benoitf · 2024-01-16T14:34:24Z

I think it's restarting the network interface if my computer goes to sleep mode and not only in the case of the bug

cfergeau · 2024-01-16T14:35:16Z

I already hit the bug but my podman machine is still reachable

You hit a condition when curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health failed (apparently 3 times in less than 10 minutes). I don't think we know if #20639 happens if and only if this condition is true, or if this condition can sometimes be true without #20639 happening.

Actually it would be (somewhat) interesting to try this change without ifconfig enp0s1 down; ifconfig enp0s1 up to see if false positives show up in the log (ie there's a log, but network is still up even if the workaround was removed from the script)

mheon · 2024-01-16T15:12:30Z

@baude @rhatdan PTAL

n1hility · 2024-01-16T15:19:01Z

I already hit the bug but my podman machine is still reachable

You hit a condition when curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health failed (apparently 3 times in less than 10 minutes). I don't think we know if #20639 happens if and only if this condition is true, or if this condition can sometimes be true without #20639 happening.

Actually it would be (somewhat) interesting to try this change without ifconfig enp0s1 down; ifconfig enp0s1 up to see if false positives show up in the log (ie there's a log, but network is still up even if the workaround was removed from the script)

Good idea. If we see lots of spurious events like this I could modify this to utilize a retry to try reduce the bounce events

benoitf · 2024-01-16T16:33:04Z

do we need a special version of gvproxy ?

I'm using the one in the installer of podman v4.8.3
and /health always return a 404 page

curl --connect-timeout 10 http://192.168.127.1/health
404 page not found

n1hility · 2024-01-16T16:54:37Z

do we need a special version of gvproxy ?

I'm using the one in the installer of podman v4.8.3 and /health always return a 404 page

There is no special version needed. The way curl is being used here the http result code doesn't matter, it's just verifying the request/reply happened. The URL suffix is just a placeholder for identification with any sort of logging on the gvproxy side.

benoitf · 2024-01-16T17:04:55Z

I'll try tomorrow without the ifconfig down / ifconfig up

but it seems I had a lot of reports today

Jan 16 15:08:10 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:09:21 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:15:36 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:27:59 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:30:11 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:44:28 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:15:59 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:19:41 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:45:11 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:55:37 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:58:52 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 18:03:07 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host

n1hility · 2024-01-16T19:09:57Z

I'll try tomorrow without the ifconfig down / ifconfig up

but it seems I had a lot of reports today

@benoitf Interesting. I am curious what you see. I tried a bunch of scenarios on my system and was not able to get this to occur from sleeps. Although I also am not seeing the underlying qemu issue.

I just pushed up a replacement that ups the timeout. If your research shows false positives, can you try with the update?

n1hility · 2024-01-16T19:12:22Z

/hold

(waiting until we wrap up the testing / verification from @benoitf )

benoitf · 2024-01-16T19:43:24Z

I've updated my podman CLI with your updated code, I will run the changes overnight

baude · 2024-01-16T22:05:49Z

I will merge this ... DO NOT MERGE. @benoitf let us know Wednesday-ish and I can get it in.

There is a network stability issue in qemu + virtio, affecting some users after long periods of usage, which can lead to suspended queue delivery. Until the issue is resolved, add a temporary recovery service which restarts networking when host communication becomes inoperable. [NO NEW TESTS NEEDED] Signed-off-by: Jason T. Greene <[email protected]>

n1hility · 2024-01-16T22:58:53Z

Updated PR to only apply to darwin qemu builds. (In discussing with @baude even though the underlying qemu/virtio issue may not be mac specific, we decided its probably better to keep this narrowed to Mac until we see reports elsewhere)

benoitf · 2024-01-17T09:27:58Z

With the new patch, I didn't get any traces in the journal and my machine is still working so I think I didn't get false positives but I wasn't yet able to reach the 'blocking state' (so ifconfig down/up wasn't triggered as well)

gbraad · 2024-01-17T09:29:13Z

the addition of /health was a request to make the use of this more obvious. If this happens a lot, there is not much we can do except for waiting from a fix from Qemu+virtio teams to resolve the actual issue. this is what @benoitf referred to as the 'bug'.

In short:

I wasn't yet able to reach the 'blocking state'

Means it is 'resolved' for you, right?

gbraad · 2024-01-17T09:29:24Z

/lgtm
/approve

openshift-ci · 2024-01-17T09:29:27Z

@gbraad: changing LGTM is restricted to collaborators

In response to this:

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2024-01-17T09:29:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gbraad, n1hility

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [n1hility]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

benoitf · 2024-01-17T09:33:15Z

In short:
I wasn't yet able to reach the 'blocking state'
Means it is 'resolved' for you, right?

well the problem is that as I didn't yet reproduced the blocking state, the script didn't do the ifconfig down/up (there is no bouncing log in journalctl)

so I wouldn't say it's 'resolved', just that I didn't see 'potential false positives' as yesterday where it was occurring by sequences

benoitf · 2024-01-17T13:11:31Z

bouncing trace occurred

Jan 17 13:30:36 localhost.localdomain net-health-recovery.sh[1976]: bouncing nic due to loss of connectivity with host
Jan 17 13:45:51 localhost.localdomain net-health-recovery.sh[1976]: bouncing nic due to loss of connectivity with host

I would say now it works better than yesterday's patch

I tried to do a lot of networking/heavy load in the VM

/lgtm

@baude OK to merge on my side

mheon · 2024-01-17T13:18:47Z

/hold cancel
/lgtm

mheon · 2024-01-17T13:19:52Z

/cherry-pick v4.9

openshift-cherrypick-robot · 2024-01-17T13:19:54Z

@mheon: once the present PR merges, I will cherry-pick it on top of v4.9 in a new PR and assign it to you.

In response to this:

/cherry-pick v4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2024-01-17T13:23:22Z

@mheon: #21262 failed to apply on top of branch "v4.9":

Applying: Add a net health recovery service to Qemu machines
Using index info to reconstruct a base tree...
A	pkg/machine/ignition/ignition.go
M	pkg/machine/qemu/machine.go
M	pkg/machine/qemu/options_linux.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/machine/qemu/options_linux.go
Auto-merging pkg/machine/qemu/machine.go
CONFLICT (content): Merge conflict in pkg/machine/qemu/machine.go
Auto-merging pkg/machine/ignition.go
CONFLICT (content): Merge conflict in pkg/machine/ignition.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add a net health recovery service to Qemu machines
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick v4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

benoitf · 2024-01-17T15:02:55Z

At some point network was stuck, I ran a ping command and we see that after a while it's coming back as the patch is bouncing the network interface 👍

n1hility · 2024-01-17T15:06:26Z

@benoitf excellent thank you so much for the thorough testing on this one and last week!

benoitf · 2024-01-17T15:11:28Z

@mheon it looks like automatic cherry-pick didn't work smoothly for 4.9 branch

will it be in time for 4.9.0 ? (or it'll part of 4.9.1)

mheon · 2024-01-17T15:16:56Z

We're having vendoring issues with 4.9 right now that have delayed the release - so it ought to be part of 4.9.0. ETA on that is hopefully this afternoon, but really depends on how difficult those vendoring issues prove to be.

n1hility · 2024-01-17T15:20:09Z

I'll quickly back port this

slemeur · 2024-01-17T15:55:24Z

Thanks for the fix!

openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 16, 2024

github-actions bot added the machine label Jan 16, 2024

n1hility changed the title ~~Add a net health recovery service to Qemu machines~~ Add a net health recovery service to qemu machines Jan 16, 2024

rhatdan reviewed Jan 16, 2024

View reviewed changes

benoitf reviewed Jan 16, 2024

View reviewed changes

benoitf added the podman-desktop label Jan 16, 2024

cfergeau reviewed Jan 16, 2024

View reviewed changes

n1hility force-pushed the net-recovery branch from b278121 to 28a759b Compare January 16, 2024 19:07

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2024

n1hility force-pushed the net-recovery branch from 28a759b to 79fad91 Compare January 16, 2024 22:48

gbraad mentioned this pull request Jan 17, 2024

podman machine on macOS becomes unresponsive after some time #20639

Closed

openshift-ci bot assigned benoitf Jan 17, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2024

openshift-ci bot assigned mheon Jan 17, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024

openshift-merge-bot bot merged commit e293ca8 into containers:main Jan 17, 2024
91 of 92 checks passed

n1hility mentioned this pull request Jan 17, 2024

[v4.9] Add a net health recovery service to qemu machines #21276

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Apr 17, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2024

Add a net health recovery service to qemu machines #21262

Add a net health recovery service to qemu machines #21262

Conversation

n1hility commented Jan 16, 2024 • edited Loading

Does this PR introduce a user-facing change?

n1hility commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashley-cui commented Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benoitf commented Jan 16, 2024

cfergeau commented Jan 16, 2024

openshift-ci bot commented Jan 16, 2024

benoitf commented Jan 16, 2024

cfergeau commented Jan 16, 2024

mheon commented Jan 16, 2024

n1hility commented Jan 16, 2024

benoitf commented Jan 16, 2024

n1hility commented Jan 16, 2024

benoitf commented Jan 16, 2024

n1hility commented Jan 16, 2024

n1hility commented Jan 16, 2024

benoitf commented Jan 16, 2024

baude commented Jan 16, 2024

n1hility commented Jan 16, 2024 • edited Loading

benoitf commented Jan 17, 2024

gbraad commented Jan 17, 2024

gbraad commented Jan 17, 2024

openshift-ci bot commented Jan 17, 2024

openshift-ci bot commented Jan 17, 2024

benoitf commented Jan 17, 2024

benoitf commented Jan 17, 2024

mheon commented Jan 17, 2024

mheon commented Jan 17, 2024

openshift-cherrypick-robot commented Jan 17, 2024

openshift-cherrypick-robot commented Jan 17, 2024

benoitf commented Jan 17, 2024

n1hility commented Jan 17, 2024

benoitf commented Jan 17, 2024

mheon commented Jan 17, 2024

n1hility commented Jan 17, 2024

slemeur commented Jan 17, 2024

n1hility commented Jan 16, 2024 •

edited

Loading

ashley-cui commented Jan 16, 2024 •

edited

Loading

n1hility commented Jan 16, 2024 •

edited

Loading