-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration creation with configurable replica count and wait time #159
base: master
Are you sure you want to change the base?
Conversation
This fixes various problems with error reporting when running tests with goroutines. Unfortunately, it still seems to fail for Style1 for any replica count, but Style2 does not fail even for replica counts up to 180.
To avoid that the process create to many connections (open files) each test should stop their servers.
This still contains the old versions that don't work, but I've disabled them. Things should be cleaned up now that the new version (seems to) work. Have been able to start 200 servers.
Previously there was no way to know whether or not a goroutine had in fact been started; henceforth, the errChan would be closed by the defer since the createReplicas() call would reach the return before all goroutines had the chance to send their err, causing a panic. In this new version, we wait for all goroutines to start before we check for errors on the channel.
tests/all2all/alltoallconfig_test.go
Outdated
nodeMap[replica.address] = replica.id | ||
} | ||
for _, replica := range replicas { | ||
replica.createConfiguration(nodeMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned by this method is not checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this seems to bring back the bug we are fighting with here. If we run the code with:
$ go test -v -run TestAllToAllConfiguration -replicas 10
=== RUN TestAllToAllConfiguration
--- PASS: TestAllToAllConfiguration (1.06s)
vs
go test -v -run TestAllToAllConfiguration -replicas 11
=== RUN TestAllToAllConfiguration
alltoallconfig_test.go:33: could not create configuration: connection failed for addr: 127.0.0.1:65204: dialing node failed: context deadline exceeded
--- FAIL: TestAllToAllConfiguration (1.16s)
And this is consistently working for up to 10 replicas, but not for 11 or above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added TestGrpcDial
to try to understand better what's going on. This version fails consistently with 84 replicas (but not with fewer on my machine that is):
go test -v -run TestGrpc -replicas 84
=== RUN TestGrpcDial
alltoallconfig_test.go:133: context deadline exceeded
--- FAIL: TestGrpcDial (2.05s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't reproduce this on my machine
❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=10000
=== RUN TestGrpcDial
--- PASS: TestGrpcDial (4.84s)
PASS
It only fails for me when the kernel runs out of ports 😅
❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=50000
=== RUN TestGrpcDial
alltoallconfig_test.go:123: listen tcp 127.0.0.1:0: bind: address already in use
--- FAIL: TestGrpcDial (9.77s)
FAIL
Just wanted to check if this issue still persists (it does), but thought I just make a few tweaks while I was at it.
Tests to form configuration with a configurable number of replicas and wait time.