-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelized cluster hosts discovering + async config cache update #4077
base: master
Are you sure you want to change the base?
Conversation
…date async Fixes #4074 This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN, it probes all other hosts in parallel and returns response from the fastest one. Additionally, this PR makes the call to cluster config cache async on updating cluster.
@Michal-Leszczynski Please take a look on the PR. |
@karol-kokoszka It looks like tests are failing. |
Yes, some of them are failing and I have no clue yet why. |
And I'm not able to reproduce it locally, using the same setup as was used on failing test. |
Looks that ensuring that all nodes are UP on the test that was failing previously helped. @Michal-Leszczynski please review. |
} | ||
|
||
// then: validate that call still takes less than 5 seconds | ||
if err := callValidateHostConnectivityWithTimeout(ctx, s, 30*time.Second, testCluster); !errors.Is(err, tc.result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have 5 seconds in the comment, and 30 seconds hard coded? Shouldn't we use the timeout here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good question.
I was bit too much optimistic with 5 secs for the cluster with unresponsive coordinator-host.
First call to coordniator already exhausts the 5 sec budged.
My tests shown that 11 seconds is a minimum to pass.
Let me change the expectation for 15 seconds. It's still OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you investigate why it takes 11 seconds and not 5 seconds + epsilon?
@@ -547,6 +553,36 @@ func tryStartAgent(t *testing.T, hosts []string) { | |||
} | |||
} | |||
|
|||
func ensureNodesAreUP(t *testing.T, hosts []string, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just create a client with all nodes and get their status from there?
If it's important to get how each nodes self reports its status, instead of the statuses perceived by a single node, we could either create per node client or forceHost
in status requests.
It would allow us to skip manual output parsing and make things easier in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the client.
But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.
Let's leave the "nodetool status" to do this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.
It should return after timeout, right? If we want for it to be quick, we can set the timeout to a low value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the timeout only, but about actual response.
If the API is blocked I won't know which nodes are UP/DN, but I do know it when I use nodetool status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but getting back to the beginning - we have a test that stops a node and blocks its API. At the end of this test we want to start the node and unblock its API. We also want to wait for those changes to really set in, so that they don't affect other tests by accident.
- Why can't we simply create a client to all of the nodes and send some per host requests (e.g.
client.Snapshots
) to all of them sequentially? Node needs to be both up and responsive on API for it to succeed. We can do it sequentially, because it's not like querying the endpoint does any work - it just check responsiveness, so making it in parallel does not provide visible time improvement, but it adds some complexity to the code. If you prefer to do it in parallel, we should use already implemented and tested functions to do so (e.g.parallel.Run
) - Shouldn't we add this waiting step to the test that stops nodes? I would say that it's easier to to remember to run this function at the end of the test that modifies node state, instead of running it at the beginning of every test that requires nodes to be in the correct state (so basically all of the tests). In this PR you encountered this problem only for a single test, but that's because of given sequential order of test execution that we shouldn't rely on.
err := WaitForNodeUP(h, timeout) | ||
if err != nil { | ||
t.Log(err) | ||
} | ||
}(host) | ||
} | ||
|
||
go func() { | ||
wg.Wait() | ||
close(done) | ||
}() | ||
|
||
select { | ||
case <-done: | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the error (that could come from the fact that a node is down) is ignored (except for being logged), so the whole function can return nil even when nodes are down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, error is just logged. It comes from the fact that WaitForNodeUP may only return error when it times out.
But the logic is blurry, I agree. Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.
Anyway, let me try to make the logic bit more easy to read and understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.
But the timeout error in this case is synonymous with node not being in UN state, right?
And then it's ignored - which means that this function has a race between the timeout set within its body and the timeout set in the per-node check. So it could return no error when all nodes are down and timeout on UN check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is a race...
Removing the last select and it's going to wait just on wg.Wait()
pkg/testutils/exec.go
Outdated
func WaitForNodeUP(h string, timeout time.Duration) error { | ||
nodeIsReady := make(chan struct{}) | ||
done := make(chan struct{}) | ||
go func() { | ||
defer close(nodeIsReady) | ||
for { | ||
select { | ||
case <-done: | ||
return | ||
case <-time.After(time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we wait 1 second even before the first check? This could be problematic with small timeouts like 1 second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Fixed
@karol-kokoszka I resolved some already fixed comments. |
Fixes #4074
This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.
Please make sure that: