-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a net health recovery service to qemu machines #21262
Conversation
FYI @benoitf // cc @baude @ashley-cui |
pkg/machine/ignition/ignition.go
Outdated
sleep 120 # allow time for network setup on initial boot | ||
while true; do | ||
sleep 30 | ||
curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this IP Address come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the gvproxy gateway address used by the guest. In addition to routing gvproxy runs a built in http server for management of port forwards on this address:
sleep 30 | ||
curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health | ||
if [ "$?" != "0" ]; then | ||
echo "bouncing nic due to loss of connectivity with host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this line is reported to ? to see if it occurred or not in my podman machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit file sends stdout&stderr to the system journal, so can be found using journalctl.
LGTM Probably needs a head nod from @baude |
|
||
sleep 120 # allow time for network setup on initial boot | ||
while true; do | ||
sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using a systemd timer which runs every 30 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yeah only reason it's done this way is to avoid the log generation noise that comes from them that @rhatdan was warning about. With a frequent timer like this it would be a lot of noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah logging, makes sense!
# is lost. This is a temporary workaround for a known rare qemu/virtio issue | ||
# that affects some systems | ||
|
||
sleep 120 # allow time for network setup on initial boot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sleep still needed with the systemd unit which has recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sshd only has an after on network.target, which is quasi reliable:
"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined [snip]"
Since we only seem to see the problem with long running vms, my thinking was it was better to just wait a bit longer in the script without disrupting / delaying boot (e.g. using something like network-online.target).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree sleeping for 2 minutes (or even 5 minutes or 1 hour or ... :) is no big deal given when the bug happens. I just wondered.
func GetNetRecoveryUnitFile() *parser.UnitFile { | ||
recoveryUnit := parser.NewUnitFile() | ||
recoveryUnit.Add("Unit", "Description", "Verifies health of network and recovers if necessary") | ||
recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, I'm not sure sshd.service
is required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sshd.socket and sshd.service are mutually exclusive alternates (currently our fcos images are using sshd.service atm. Our other units are declared as after both (my assumption is to be compatible if the base images witch to the inet socket approach in the future), so just mirroring that pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I assumed the FCOS images used sshd.socket
already.
I'm using the patch since this morning
I already hit the bug but my podman machine is still reachable |
/lgtm |
@cfergeau: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think it's restarting the network interface if my computer goes to sleep mode and not only in the case of the bug |
You hit a condition when Actually it would be (somewhat) interesting to try this change without |
Good idea. If we see lots of spurious events like this I could modify this to utilize a retry to try reduce the bounce events |
do we need a special version of gvproxy ? I'm using the one in the installer of podman v4.8.3
|
There is no special version needed. The way curl is being used here the http result code doesn't matter, it's just verifying the request/reply happened. The URL suffix is just a placeholder for identification with any sort of logging on the gvproxy side. |
I'll try tomorrow without the but it seems I had a lot of reports today
|
b278121
to
28a759b
Compare
@benoitf Interesting. I am curious what you see. I tried a bunch of scenarios on my system and was not able to get this to occur from sleeps. Although I also am not seeing the underlying qemu issue. I just pushed up a replacement that ups the timeout. If your research shows false positives, can you try with the update? |
/hold (waiting until we wrap up the testing / verification from @benoitf ) |
I've updated my podman CLI with your updated code, I will run the changes overnight |
I will merge this ... DO NOT MERGE. @benoitf let us know Wednesday-ish and I can get it in. |
There is a network stability issue in qemu + virtio, affecting some users after long periods of usage, which can lead to suspended queue delivery. Until the issue is resolved, add a temporary recovery service which restarts networking when host communication becomes inoperable. [NO NEW TESTS NEEDED] Signed-off-by: Jason T. Greene <[email protected]>
28a759b
to
79fad91
Compare
Updated PR to only apply to darwin qemu builds. (In discussing with @baude even though the underlying qemu/virtio issue may not be mac specific, we decided its probably better to keep this narrowed to Mac until we see reports elsewhere) |
With the new patch, I didn't get any traces in the journal and my machine is still working so I think I didn't get false positives but I wasn't yet able to reach the 'blocking state' (so ifconfig down/up wasn't triggered as well) |
the addition of In short:
Means it is 'resolved' for you, right? |
/lgtm |
@gbraad: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gbraad, n1hility The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
well the problem is that as I didn't yet reproduced the blocking state, the script didn't do the ifconfig down/up (there is no bouncing log in journalctl) so I wouldn't say it's 'resolved', just that I didn't see 'potential false positives' as yesterday where it was occurring by sequences |
bouncing trace occurred
I would say now it works better than yesterday's patch I tried to do a lot of networking/heavy load in the VM /lgtm @baude OK to merge on my side |
/hold cancel |
/cherry-pick v4.9 |
@mheon: once the present PR merges, I will cherry-pick it on top of v4.9 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e293ca8
into
containers:main
@mheon: #21262 failed to apply on top of branch "v4.9":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@benoitf excellent thank you so much for the thorough testing on this one and last week! |
@mheon it looks like automatic cherry-pick didn't work smoothly for 4.9 branch will it be in time for 4.9.0 ? (or it'll part of 4.9.1) |
We're having vendoring issues with 4.9 right now that have delayed the release - so it ought to be part of 4.9.0. ETA on that is hopefully this afternoon, but really depends on how difficult those vendoring issues prove to be. |
I'll quickly back port this |
Thanks for the fix! |
There is a network stability issue in qemu + virtio, affecting some users after long periods of usage, which can lead to suspended queue delivery. Until the issue is resolved, add a temporary recovery service which restarts networking when host communication becomes inoperable. Only qemu based machines on mac activate this service as the issue is understood to be qemu specific.
Works around issue in #20639
How to verify:
Does this PR introduce a user-facing change?