Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized cluster hosts discovering + async config cache update #4077

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

karol-kokoszka
Copy link
Collaborator

Fixes #4074

This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.


Please make sure that:

  • Code is split to commits that address a single change
  • Commit messages are informative
  • Commit titles have module prefix
  • Commit titles have issue nr. suffix

…date async

Fixes #4074

This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.
@karol-kokoszka
Copy link
Collaborator Author

@Michal-Leszczynski Please take a look on the PR.

@Michal-Leszczynski
Copy link
Collaborator

@karol-kokoszka It looks like tests are failing.

@karol-kokoszka
Copy link
Collaborator Author

Yes, some of them are failing and I have no clue yet why.

@karol-kokoszka
Copy link
Collaborator Author

And I'm not able to reproduce it locally, using the same setup as was used on failing test.
It basically PASSes locally.
I'm starting to think that it's something related to the env on gh actions.

@karol-kokoszka
Copy link
Collaborator Author

Looks that ensuring that all nodes are UP on the test that was failing previously helped.
We are starting/stopping services using supervisorctl as we refer to the docker environment. Service start doesn't mean that the node is UP.
I added simple utility function that ensures that nodes are UP.

@Michal-Leszczynski please review.

pkg/service/cluster/service_integration_test.go Outdated Show resolved Hide resolved
pkg/service/cluster/service_integration_test.go Outdated Show resolved Hide resolved
pkg/service/cluster/service_integration_test.go Outdated Show resolved Hide resolved
}

// then: validate that call still takes less than 5 seconds
if err := callValidateHostConnectivityWithTimeout(ctx, s, 30*time.Second, testCluster); !errors.Is(err, tc.result) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have 5 seconds in the comment, and 30 seconds hard coded? Shouldn't we use the timeout here instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good question.
I was bit too much optimistic with 5 secs for the cluster with unresponsive coordinator-host.

First call to coordniator already exhausts the 5 sec budged.
My tests shown that 11 seconds is a minimum to pass.

Let me change the expectation for 15 seconds. It's still OK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you investigate why it takes 11 seconds and not 5 seconds + epsilon?

pkg/service/cluster/service_integration_test.go Outdated Show resolved Hide resolved
pkg/service/cluster/service.go Show resolved Hide resolved
@@ -547,6 +553,36 @@ func tryStartAgent(t *testing.T, hosts []string) {
}
}

func ensureNodesAreUP(t *testing.T, hosts []string, timeout time.Duration) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just create a client with all nodes and get their status from there?
If it's important to get how each nodes self reports its status, instead of the statuses perceived by a single node, we could either create per node client or forceHost in status requests.
It would allow us to skip manual output parsing and make things easier in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the client.
But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.
Let's leave the "nodetool status" to do this check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.

It should return after timeout, right? If we want for it to be quick, we can set the timeout to a low value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about the timeout only, but about actual response.
If the API is blocked I won't know which nodes are UP/DN, but I do know it when I use nodetool status.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but getting back to the beginning - we have a test that stops a node and blocks its API. At the end of this test we want to start the node and unblock its API. We also want to wait for those changes to really set in, so that they don't affect other tests by accident.

  1. Why can't we simply create a client to all of the nodes and send some per host requests (e.g. client.Snapshots) to all of them sequentially? Node needs to be both up and responsive on API for it to succeed. We can do it sequentially, because it's not like querying the endpoint does any work - it just check responsiveness, so making it in parallel does not provide visible time improvement, but it adds some complexity to the code. If you prefer to do it in parallel, we should use already implemented and tested functions to do so (e.g. parallel.Run)
  2. Shouldn't we add this waiting step to the test that stops nodes? I would say that it's easier to to remember to run this function at the end of the test that modifies node state, instead of running it at the beginning of every test that requires nodes to be in the correct state (so basically all of the tests). In this PR you encountered this problem only for a single test, but that's because of given sequential order of test execution that we shouldn't rely on.

Comment on lines 566 to 580
err := WaitForNodeUP(h, timeout)
if err != nil {
t.Log(err)
}
}(host)
}

go func() {
wg.Wait()
close(done)
}()

select {
case <-done:
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the error (that could come from the fact that a node is down) is ignored (except for being logged), so the whole function can return nil even when nodes are down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, error is just logged. It comes from the fact that WaitForNodeUP may only return error when it times out.
But the logic is blurry, I agree. Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.

Anyway, let me try to make the logic bit more easy to read and understand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.

But the timeout error in this case is synonymous with node not being in UN state, right?
And then it's ignored - which means that this function has a race between the timeout set within its body and the timeout set in the per-node check. So it could return no error when all nodes are down and timeout on UN check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is a race...
Removing the last select and it's going to wait just on wg.Wait()

Comment on lines 109 to 118
func WaitForNodeUP(h string, timeout time.Duration) error {
nodeIsReady := make(chan struct{})
done := make(chan struct{})
go func() {
defer close(nodeIsReady)
for {
select {
case <-done:
return
case <-time.After(time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we wait 1 second even before the first check? This could be problematic with small timeouts like 1 second.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed

pkg/service/healthcheck/service_integration_test.go Outdated Show resolved Hide resolved
@Michal-Leszczynski
Copy link
Collaborator

@karol-kokoszka I resolved some already fixed comments.
Comments regarding checking node state in tests (com1, com2, com3) are tied to this conversation.
When all comments are resolved, I will re-review this PR and we will hopefully be able to merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster update API call may timeout when one of the known hosts is down
2 participants