[relay client] choose nearest relay based on latency #2952

mohamed-essam · 2024-11-25T16:56:24Z

Describe your changes

Change picker to choose nearest relay based on connection latency.

Issue ticket number and link

#2950

Checklist

Is it a bug fix
Is a typo/documentation fix
Is a feature enhancement
It is a refactor
Created tests that fail without the change (if possible)
Extended the README / documentation, if necessary

Signed-off-by: M Essam Hamed <[email protected]>

pappz · 2024-11-26T16:07:35Z

Hi @mohamed-essam,
First of all, thank you for your PR. Your observation regarding the scheduler is a good point.

When we designed this feature we focused on the fastest connection time as possible. In a corner case, your logic will flip this requirement and block the connection more then the defined timeout time. Imagine that situation if the length of the list of servers is 8. And you have two really slow servers. Actually one of them is between 1-7 and 8. In this case, the code will block until 2*30 sec.

For your scenario, we suggest using only one domain with the Geo DNS service.

Hard to write good code for the scheduler but we can take try. I think your solution could works well because the
client.Connect() does not start new go routines so the scheduler does not trigger context switches, luckily. But if we develop something inside the Client struct that break this rule then your solution will be provide invalid latency numbers again. So in a nutshell you measure the running time of the client.Connect() and not just the network latency. And by the way I am working on a task that will start new parallel connection inside the Client code :/ . In the ideal case we should extend our network protocol, with elapsed time in the response (somehow).

But back to your PR. What do you think if after this line, you check the length of the channel and if it contains more then one items, then drain the channel and compare the elapsed times on that? And do not wait for any new connections.

mohamed-essam · 2024-11-26T18:49:19Z

Hello @pappz,

Thanks for the clarification of the requirement, I haven't really considered this case in my change, I think it might be possible to do something that is a mix of both, as in most cases delaying initial connection by some amount (think for example 200ms), could result in a much faster relay off the get go.

As for the Geo DNS, my problem with that solution is that it would rely only on geographical distance latency, and not take into account (for example) relay server load.

I'm working on a commit now that should address both the corner case of slow servers (by having some sort of upper limit of waiting after the first successful connection, which I'm thinking a value between 200-500ms should be sufficient to both allow cases of Go scheduling delays and weird networking to resolve), and the measurement of latency (by using httptrace inside ws.go Dial())

Please let me know what you think of that previous approach even if I haven't committed it yet 🙏

… to other servers

pappz · 2024-11-28T23:36:45Z

Measuring the time in deeper layers looks promising. Good idea. I will take a look and we will see.

I am still thinking about the length of the result channel in a 1 core system as you mentioned in the ticket. Can I ask you for a test? Could you print out the length of the result channel every time when read a result from the channel?

cr := <-resultChan
log.Debugf("resultChan len: %d", len(resultChan))

If queued with multiple results, then we do not need to wait more. In the end, maybe we do not choose the fastest server but definitely do not choose the slowest. If this approach shows unexpected results then the time window sounds a good way.

mohamed-essam · 2024-11-29T11:50:28Z

I don't have access to a single core machine today, so as a workaround I executed netbird -F with cpuaffinity to force it into a single CPU.
I added one more debug line before receving from the channel (to see if there is sometimes a connection already ready without the processConnResults goroutine halting), and that seems to be the case:

2024-11-29T13:38:58+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=1
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=1
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0

In this case (3 relays), there were 2 times where a connection was ready immediately after the last connection was processed.

This also seemed to be dependent on whether the CPU was loaded or not, when loaded it was more likely to be always empty (before and after select statement).

One more thing I noticed, is that when CPU was free, the order of the connections was almost always (somewhat) already sorted by latency, but when CPU was loaded, the order of connections was more likely to be the order in which they were configured in management. I'm assuming this is because the first item in the array is likely to be connected to first if the CPU load is high enough for the process to wait a bit until connecting to the next relay.

All in all, the slight time delay seems to be always choosing the lowest latency in both cases.

pappz · 2024-12-04T16:06:20Z

Thank you for sharing your experience.
I would like to draw your attention to this work. What do you think is the net/http/httptrace usable in QUIC too? If so, I don't see any other blocking issues in this topic.

mohamed-essam · 2024-12-04T17:21:27Z

That's some nice work on the QUIC protocol, but given it's using the quic-go package I see no hint anywhere in it that it supports net/http/httptracer

However, it seems like it already implements its own version of it in logging.ConnectionTracer.

After some digging (due to the lack of documentation on that package), I found the equivalent methods of it we would need are

StartedConnection (which is called right before sending anything)
ChoseAlpn (which is called right after the initial handshake is done, which is the closest to TTFB in httptracer)

If the QUIC relay is merged before this PR, I could test the previous myself, if you can test the ConnectionTracer on the side and provide the difference between httptracer and ConnectionTracer values I believe it would provide a helpful insight.

sonarqubecloud · 2024-12-05T14:22:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mohamed-essam added 2 commits November 25, 2024 18:52

relay client: choose nearest relay based on latency

ed1a18a

Signed-off-by: M Essam Hamed <[email protected]>

Do not send nil relayClient to successChan

db61cf2

mohamed-essam added 2 commits November 26, 2024 21:15

Add timeout after first relay success

18216ba

check if we have a successful connection so we do not need to connect…

adf31ce

… to other servers

mohamed-essam force-pushed the main branch from 0dfc17a to adf31ce Compare November 27, 2024 08:03

mgarces linked an issue Nov 29, 2024 that may be closed by this pull request

[Relay] chosen home relay server is random #2950

Open

Fix panic on nil processingCtxCancel

df1c905

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[relay client] choose nearest relay based on latency #2952

[relay client] choose nearest relay based on latency #2952

mohamed-essam commented Nov 25, 2024

pappz commented Nov 26, 2024

mohamed-essam commented Nov 26, 2024

pappz commented Nov 28, 2024 •

edited

Loading

mohamed-essam commented Nov 29, 2024 •

edited

Loading

pappz commented Dec 4, 2024

mohamed-essam commented Dec 4, 2024

sonarqubecloud bot commented Dec 5, 2024

[relay client] choose nearest relay based on latency #2952

Are you sure you want to change the base?

[relay client] choose nearest relay based on latency #2952

Conversation

mohamed-essam commented Nov 25, 2024

Describe your changes

Issue ticket number and link

Checklist

pappz commented Nov 26, 2024

mohamed-essam commented Nov 26, 2024

pappz commented Nov 28, 2024 • edited Loading

mohamed-essam commented Nov 29, 2024 • edited Loading

pappz commented Dec 4, 2024

mohamed-essam commented Dec 4, 2024

sonarqubecloud bot commented Dec 5, 2024

Quality Gate passed

pappz commented Nov 28, 2024 •

edited

Loading

mohamed-essam commented Nov 29, 2024 •

edited

Loading