-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pasta, parallel: test command timeout #23471
Comments
One of the socat processes is not exiting causing the wait in bash to hang until we SIGKILL the entire container
So it seems to wait for more io I think? |
I added pcap captures to pasta to see what is send
this shows a test with three senders/receivers but in the pcap file we can see that only for two of the three our packages were seen by pasta
For port 5040 there were no packages which could mean the packages were never send or dropped on the host side somehow |
Even more weird
Two tests using the same port (5520) at the same time and pasta is not failing to bind it with EADDRINUSE? |
Oh, oops, setting the
and exactly the same in another terminal. If I drop
I didn't expect that. On TCP it fails as expected. Actually, we realised that for unrelated reasons, and started exploiting this behaviour (but not for UDP port forwarding), which is now For port forwarding sockets, we should either disable REUSEADDR (@dgibson, any concern with that?) or explicitly check in another way if ports are conflicting. |
I think that explains the hangs as well, all the ADDRINUSE errors were seen with tcp so I assume if that happens in the udp case it can hang there because another socket with the same port got the traffic causing the hang as the socat in the container never got the "null-eof" udp package to terminate. I am not familiar with REUSEADDR but reading the docs
And udp sockets do not count as listening sockets I guess and thus not cause a bind failure. |
Oh, I didn't see those.
Right. This is actually an issue in the tests I would say, even though we should fix pasta as well.
Not so much, in the Linux kernel the listening state for UDP is represented exactly in the same way as for TCP. In general, UDP can have "listening" sockets, in kernel terms.
The INADDR_ANY clause is misleading, because, for TCP (and that's the reason why we use SO_REUSEADDR, so that users can forward ports differently based on different addresses), you get, without SO_REUSEADDR:
and:
but with SO_REUSEADDR, that succeeds. My interpretation of the second sentence there (with the addition of "same" before "address") is that it describes the behaviour with SO_REUSEADDR, but the third one describes the behaviour without it. We should eventually send a patch for that, now that we have a test program it should be doable. Back to pasta: if we disable SO_REUSEADDR for UDP, we lose the ability to do exactly what I described above: bind UDP ports using different (local) addresses so that we can handle traffic differently. I guess we need to find another way to check if there's another process binding to the same exact (local) address and port. The |
Looking this up somewhere just sounds racy to me and more complicated. IMO either bind fails or not.
But isn't this only true if you want to bind 0.0.0.0 and a specific ip at the same there. You do not need SO_REUSEADDR to bind 127.0.0.1 and 127.0.0.2 for example. We do not use SO_REUSEADDR when binding ports as root. I don't find a mention of REUSEADDR in slirp4netns or our rootlessport forwarder either and AFAIK we never had any user ask for such behavior so I don't think this important to use REUSEADDR at all. I could see that there are cases for it but shouldn't these be opt-in then? The fact that SO_REUSEADDR with udp behaves like that seems really odd to me. Isn't that what SO_REUSEPORT is supposed to allow? |
For sure it's more complicated, but it doesn't need to be racy, for example if we bind() and then check that there's a single entry in
Correct, but see #14425. Sure, we also use SO_BINDTODEVICE there, but if I recall correctly we can't anyway bind twice without SO_REUSEADDR. And regardless of SO_BINDTODEVICE, one might anyway want to use that without an interface. Perhaps some Podman users who needed #14425 are actually using that feature by binding without an interface given that we haven't added it to Podman's options, yet.
Sure, it can be a configuration option, but for TCP, it's harmless, so that would be a rather useless configuration option. We could also decide to make it optional just for UDP. On the other hand, what happens if another process uses SO_REUSEADDR? From a quick glance at this beauty in net/ipv4/udp.c:
I would say that it doesn't matter who uses SO_REUSEADDR. We might succeed to bind() to a UDP port, and yet get absolutely no traffic there, because another process bound to it with SO_REUSEADDR. We should test this I guess. But in that case, does it make a difference, if we fail? We're not really guaranteeing anything in either case.
Yes, same to me.
Yes, definitely. We can also try to fix that in the kernel if we can show there's no reason to use SO_REUSEADDR for that and that the risk for https://xkcd.com/1172/ is reasonably low. |
Fair enough
Well easy enough to test, if I bind first without SO_REUSEADDR then the other process cannot bind even when reuseaddr is set so that works like it should.
So overall I would say you should drop SO_REUSEADDR from udp sockets IMO or well fix the kernel if that is even possible without breaking things? |
Hah, thanks, I forgot that socat can set that too.
Most likely one of the two, or both, yes. Should I also take care of fixing the tests, or is this covered somewhere else/by somebody else? |
Which part? Making the test port assignment parallel safe is certainly for us to do and not you, once that is parallel safe and we know we don't get port conflicts then we see how many pasta issues are actually left... |
The part of not picking a port that's bound as free, I guess I added/changed that function, but I think I've seen another issue with a comment of yours about it. |
20b3f5e which is all WIP in the parallel system test PR, I have not checked the math but clearly it isn't working so it leave that for @edsantiago to fix for now. Before that we never had any parallel runs so the function didn't had to be parallel safe so that was totally fine. |
Ok with the port issue being fixed I no longer see the quick/easy udp hangs (#23488 (comment)) However there are still hangs observed but via TCP
I will try to instrument the test with package captures to see what is going on there |
We talked about this with @dgibson and there's a potential problem with disabling SO_REUSEADDR for UDP sockets, albeit unlikely to happen. If a UDP port is forwarded to the container, we'll bind() it with SO_REUSEADDR at start-up, with a wildcard address or with an address given by the user. If, later, a process in the container uses the same UDP port as source, for outbound traffic, as we try to preserve source ports, we'll bind that same port with SO_REUSEADDR with the appropriate, specific address, which succeeds. Not only that: we actually connect() that socket, which is a stronger indication, used by the kernel, to deliver matching traffic to that specific socket, instead of the one configured for port forwarding. At this point, the traffic for the flow initiated by the container is properly routed over that flow, and any other traffic going to that same port is forwarded as configured instead. If we drop SO_REUSEADDR, we'll fail to bind() the socket for the outbound flow, and we can then choose to either use another port as source, or drop packets hoping that the process will pick another port as source. In both cases, we'll not be able to preserve the source port. This is not ideal because the port forwarding configuration will determine whether we are able to preserve source ports. This case is unlikely to happen because the Linux kernel should anyway pick ephemeral ports as source ports, and those shouldn't be forwarded, typically. However, there's no common definition of ephemeral ports: while IANA recommends the 49152–65535 range, Linux will, by default, use the 32768–60999 range instead. If we decide to stop trying to preserve source ports, that's not an issue anymore, but some users and protocols might rely on that behaviour. All in all, my favourite option at this point would be to keep SO_REUSEADDR with a note in the man page explaining that we won't detect "bind" conflicts with UDP ports. Let me know if it's fine by you. Note by the way that, unlike what happens with TCP, the conflict might even be temporary, because we can succeed binding to a port with the same address as bound by an existing process, and we'll actually start getting packets once this other process terminates. If that's the intended operation in a given use case, reporting a conflict might actually be a nuisance. |
I don't like this. A podman users expects us to report a conflict. A user should not be allowed to bind the same port twice by default.
The port logic was fixed. There should be no conflicts in the tests anymore but we are still seeing hangs even when using unique ports. Actually I just got a hang with udp too and I really don't think the current port logic is buggy and gives use the same port twice 7c3d294 Something else is wrong either with pasta or the tests itself. I am still trying to properly setup package captures on the host. |
I get this, I don't like it either, but what do you propose to do if the source port of an outbound flow clashes with a forwarded port? We can:
All of these look much worse than the alternative to me: 1. might break UDP connectivity altogether, 2. might break some protocols and use cases, and it's rather unpredictable and subtle, 3. might also even more protocols and use cases, while being less subtle. Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR. It's slightly racy as well (even though we don't know if there's no race without SO_REUSEADDR anyway, between two processes calling bind() at just around the same time, because the socket is not really bound anyway), and it might also break use cases similar to #14425, but at least we won't have an issue with preserving source ports if they conflict with forwarded ports. Yet another alternative could be to try and "fix" this in the kernel, first. If that fails, we'll evaluate again what to do in pasta. If we can fix this in the kernel, that would be the best outcome as we'll then be able to report true conflicts only, but it's probably going to take a significant amount of time.
Ouch. What might be more likely to happen in a parallelised setup is that the server isn't ready as we start the client. There's no way to detect that socat is ready, except for checking procfs entries, so right now we have those delays (which I still have to replace with something else in order to speed up tests). To see if that's the case, you can try to increase those delays even further, I guess. Eventually, I plan to replace them with a loop on /proc/net/udp{,v6} or something of that sort. |
What happens today if the source port is already bound on the host and you fail to bind? 1 or 2?
I am not sure if feasible but is there a 4) use the socket bound for the port forwarding so send outgoing package? Of course that would only make sense if the port mapping is 1 to 1 for the port numbers, mapping something else like I think 3 is definitely the worst option. There are certainly protocols that need the port information preserved to work. But also I agree with your other points.
This sounds like a good compromise to me.
Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails? |
Heh, no, we can't, because outbound sockets need to be connect()ed so that we know which flow they belong to (without an extra lookup table, at least), see the theory of operation for reply sockets.
Okay, good... we can also keep this one as fallback plan if it turns out we can't fix the kernel.
No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads). |
Oops, I missed this question. We do 1. for "spliced" (local) sockets. For the general (non-local) case, I'm actually lost regarding the new behaviour... @dgibson? |
Sorry, finally catching up on all the updates here. Replying to assorted points.
Still (1); in
Well.. we could maybe do this (my earlier designs for UDP flow handling did something like it). It's a pain to implement though, because we need a whole extra hash table to look up if we have an existing socket bound to a suitable address for outgoing packets. Plus, it introduces a different nasty edge case: the existing socket could be bound to a wildcard address, but we have a specific source address we're supposed to be using for our outgoing packets (e.g. because of
Actually we could do that. If you have a port forward on a specific address, you could still get an unrelated flow trying to use the same source port and source address for an outgiong send. So with the kernel change this would again fail. |
For the short to medium term, I'm really inclined towards option (1). Yes it has an ugly edge case, but that doesn't do anything that we couldn't already hit if the guest's source port conflicts with some unrelated process on the host. We could implement (2) or (3) on top of that in future, if it seems necessary. Regarding (3), I'm not that sure that preserving the source port is as big a win as it seems. It's true that there are UDP protocols that require specific source ports - however, a number of those also have non-standard requirements about how the addresses are used that already won't fit into the existing flow model. Doing (3) for "normal" UDP flows, doesn't preclude us from implementing special case handling for specific protocols. |
Well, but the reply socket is connect()ed in that case, the "listening" socket isn't, and we could keep that distinction in the kernel fix.
That's not such a big difference for the typical usage of pasta with Podman, probably, but it makes a big difference for common usages of passt and KubeVirt where all non-ephemeral ports in a pod are forwarded by passt (and no other ports are bound by other processes in that network namespace). There's another thing we break in any case if we drop SO_REUSEADDR: the possibility of forwarding the same port while binding to two different addresses.
On the other hand, we probably won't want to add special handling for specific protocols (application-level gateways or suchlike), it's quite hard to get those right and they are usually one of the main sources of security issues (just look at Slirp). If we strive to keep network transparency wherever possible, it's quite likely that we won't ever need ALGs (except for the minimal DNS forwarding implementation we already have). Do you see any problem with this alternative by the way:
? |
Nope. Because you have to
Ah, hrm. Yeah, I forgot about
Well, only the case where one of the binds is a wildcard and the other is not. And arguably in that case reporting the conflict makes sense.
Yeah, I guess that's true. I wasn't thinking of ALGs per se, but of special flow types which have different rules about port allocation. "Spliced" connections are already kind of an example of this.
Theoretically racy, but I guess it's probably ok. When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port. |
Ouch, right.
I wouldn't be so sure that the non-SO_REUSEADDR case is not racy, by the way.
Well, in that case we'll have an actual conflict and I guess it's good that we block that. In any case, I filed https://bugs.passt.top/show_bug.cgi?id=92 for this, so that we don't spam this ticket (which is about the test cases) too much. |
A race between what and what are you thinking of?
I don't see that that conflict is any more "actual" than the ones we'd block with option (1). In this case the reply socket is already
Ok. |
Between bind() and bind() to the same port, same address, without SO_REUSEADDR, or with SO_REUSEADDR on one socket only. If the reason why we can bind to the same port and address with SO_REUSEADDR is that the port isn't really "bound" in every sense, then we can't exclude there are other issues in that sense, I guess.
Hmm, right. By the way, I missed one obvious, important fact: we pre-open the procfs entries with information about bound ports, so we could always access them. It's not practical to read out the whole file every time we create a "reply" socket, but we don't need any additional check on that path. We could simply read the file whenever we bind a UDP port for inbound forwarding. How does that sound? |
Still not really seeing the scenario you're getting at. Without dynamic forwarding the "listening" binds always go first, so there's no self-race. There's a race between us opening a reply socket and some other process opening a conflicting one without REUSEADDR, but that's outside our control anyway. With dynamic forwarding there is certainly a race, but no more so AFAICT than with the proposal below.
Not really sure what you mean by that.
I guess the key difference from the open without REUSEADDR, then open with, is that we don't have a window where our initial !REUSEADDR binding could mess with other things. It does also mean that we check some extra information in the /proc file for more nuance. I think it works. |
I have patch locally based on #23275 to convert the test to explicit port bound checks in the container instead of the sleep delay approach currently. In the non parallel test case it is already much much faster in total test time for the file. I have the parallel case running in a loop now to see if this fixes the cause of the hangs here as well. |
Do not rely on an arbitrary delat in order to ensure the port was bound in the container. Insted this approach checks if the port is bound in the netns and only then starts the client. This speeds up the entire test file by 50% but more importanly in parallel testing it solves hangs as the timeout there was unreliable. Fixes containers#23471 Signed-off-by: Paul Holzinger <[email protected]>
Do not rely on an arbitrary delay in order to ensure the port was bound in the container. Instead this approach checks if the port is bound in the netns and only then starts the client. This speeds up the entire test file by 50% but more importantly in parallel testing it solves hangs as the timeout there was unreliable. Fixes containers#23471 Signed-off-by: Paul Holzinger <[email protected]>
Seeing this occasionally in #23275. The actual test name varies. The common aspect is:
The text was updated successfully, but these errors were encountered: