Parallelized cluster hosts discovering + async config cache update #4077

karol-kokoszka · 2024-10-22T18:48:57Z

This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.

Please make sure that:

Code is split to commits that address a single change
Commit messages are informative
Commit titles have module prefix
Commit titles have issue nr. suffix

…s not timing out

…date async Fixes #4074 This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN, it probes all other hosts in parallel and returns response from the fastest one. Additionally, this PR makes the call to cluster config cache async on updating cluster.

karol-kokoszka · 2024-10-23T07:58:08Z

@Michal-Leszczynski Please take a look on the PR.

Michal-Leszczynski · 2024-10-23T13:32:13Z

@karol-kokoszka It looks like tests are failing.

karol-kokoszka · 2024-10-23T13:40:14Z

Yes, some of them are failing and I have no clue yet why.

karol-kokoszka · 2024-10-23T14:23:59Z

And I'm not able to reproduce it locally, using the same setup as was used on failing test.
It basically PASSes locally.
I'm starting to think that it's something related to the env on gh actions.

karol-kokoszka · 2024-10-23T16:24:27Z

Looks that ensuring that all nodes are UP on the test that was failing previously helped.
We are starting/stopping services using supervisorctl as we refer to the docker environment. Service start doesn't mean that the node is UP.
I added simple utility function that ensures that nodes are UP.

@Michal-Leszczynski please review.

pkg/service/cluster/service_integration_test.go

Michal-Leszczynski · 2024-10-24T06:35:33Z

pkg/service/cluster/service_integration_test.go

+			}
+
+			// then: validate that call still takes less than 5 seconds
+			if err := callValidateHostConnectivityWithTimeout(ctx, s, 30*time.Second, testCluster); !errors.Is(err, tc.result) {


Why do we have 5 seconds in the comment, and 30 seconds hard coded? Shouldn't we use the timeout here instead?

Yeah, that's a good question.
I was bit too much optimistic with 5 secs for the cluster with unresponsive coordinator-host.

First call to coordniator already exhausts the 5 sec budged.
My tests shown that 11 seconds is a minimum to pass.

Let me change the expectation for 15 seconds. It's still OK.

Did you investigate why it takes 11 seconds and not 5 seconds + epsilon?

pkg/service/cluster/service_integration_test.go

pkg/service/cluster/service.go

Michal-Leszczynski · 2024-10-24T08:10:59Z

pkg/service/healthcheck/service_integration_test.go

@@ -547,6 +553,36 @@ func tryStartAgent(t *testing.T, hosts []string) {
 	}
 }

+func ensureNodesAreUP(t *testing.T, hosts []string, timeout time.Duration) error {


Can't we just create a client with all nodes and get their status from there?
If it's important to get how each nodes self reports its status, instead of the statuses perceived by a single node, we could either create per node client or forceHost in status requests.
It would allow us to skip manual output parsing and make things easier in general.

We could use the client.
But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.
Let's leave the "nodetool status" to do this check.

But if someone called to blockScyllaAPI before, then it will never return the status found with the gossiper as the call will be blocked.

It should return after timeout, right? If we want for it to be quick, we can set the timeout to a low value.

It's not about the timeout only, but about actual response.
If the API is blocked I won't know which nodes are UP/DN, but I do know it when I use nodetool status.

Okay, but getting back to the beginning - we have a test that stops a node and blocks its API. At the end of this test we want to start the node and unblock its API. We also want to wait for those changes to really set in, so that they don't affect other tests by accident.

Why can't we simply create a client to all of the nodes and send some per host requests (e.g. client.Snapshots) to all of them sequentially? Node needs to be both up and responsive on API for it to succeed. We can do it sequentially, because it's not like querying the endpoint does any work - it just check responsiveness, so making it in parallel does not provide visible time improvement, but it adds some complexity to the code. If you prefer to do it in parallel, we should use already implemented and tested functions to do so (e.g. parallel.Run)

Shouldn't we add this waiting step to the test that stops nodes? I would say that it's easier to to remember to run this function at the end of the test that modifies node state, instead of running it at the beginning of every test that requires nodes to be in the correct state (so basically all of the tests). In this PR you encountered this problem only for a single test, but that's because of given sequential order of test execution that we shouldn't rely on.

Michal-Leszczynski · 2024-10-24T08:15:53Z

pkg/service/healthcheck/service_integration_test.go

+			err := WaitForNodeUP(h, timeout)
+			if err != nil {
+				t.Log(err)
+			}
+		}(host)
+	}
+
+	go func() {
+		wg.Wait()
+		close(done)
+	}()
+
+	select {
+	case <-done:
+		return nil


So the error (that could come from the fact that a node is down) is ignored (except for being logged), so the whole function can return nil even when nodes are down?

Yes, error is just logged. It comes from the fact that WaitForNodeUP may only return error when it times out.
But the logic is blurry, I agree. Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.

Anyway, let me try to make the logic bit more easy to read and understand.

Even though that the function will not return that node is UP when it's DOWN, but it will return timeout error.

But the timeout error in this case is synonymous with node not being in UN state, right?
And then it's ignored - which means that this function has a race between the timeout set within its body and the timeout set in the per-node check. So it could return no error when all nodes are down and timeout on UN check.

I agree that there is a race...
Removing the last select and it's going to wait just on wg.Wait()

Michal-Leszczynski · 2024-10-24T08:22:56Z

pkg/testutils/exec.go

+func WaitForNodeUP(h string, timeout time.Duration) error {
+	nodeIsReady := make(chan struct{})
+	done := make(chan struct{})
+	go func() {
+		defer close(nodeIsReady)
+		for {
+			select {
+			case <-done:
+				return
+			case <-time.After(time.Second):


So we wait 1 second even before the first check? This could be problematic with small timeouts like 1 second.

You are right. Fixed

pkg/service/healthcheck/service_integration_test.go

pkg/service/cluster/service.go

Michal-Leszczynski · 2024-11-05T11:21:46Z

@karol-kokoszka I resolved some already fixed comments.
Comments regarding checking node state in tests (com1, com2, com3) are tied to this conversation.
When all comments are resolved, I will re-review this PR and we will hopefully be able to merge it!

karol-kokoszka added 2 commits October 22, 2024 15:38

fix(cluster): add integration-test making sure that host validation i…

24f70f1

…s not timing out

karol-kokoszka requested a review from Michal-Leszczynski as a code owner October 22, 2024 18:48

fix(integration-tests): utility to validate if nodes are up

986c9f4

Michal-Leszczynski reviewed Oct 24, 2024

View reviewed changes

pkg/service/cluster/service.go Outdated Show resolved Hide resolved

karol-kokoszka added 7 commits October 24, 2024 20:35

review fixes

c92d05c

linter error fix

92fe503

speed up finding closest DC by filtering out dead nodes

c302907

speed up finding closest DC by filtering out dead nodes

c4565ac

update WaitForNodesUP

8130d5f

Updated context.Cancelled error check.

e85dd4d

Updated context.Cancelled error check.

f2f92cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelized cluster hosts discovering + async config cache update #4077

Parallelized cluster hosts discovering + async config cache update #4077

karol-kokoszka commented Oct 22, 2024

karol-kokoszka commented Oct 23, 2024

Michal-Leszczynski commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

Michal-Leszczynski Oct 24, 2024

karol-kokoszka Oct 24, 2024

Michal-Leszczynski Oct 28, 2024

Michal-Leszczynski Oct 24, 2024

karol-kokoszka Oct 24, 2024

Michal-Leszczynski Oct 25, 2024

karol-kokoszka Oct 25, 2024

Michal-Leszczynski Nov 5, 2024

Michal-Leszczynski Oct 24, 2024

karol-kokoszka Oct 24, 2024

Michal-Leszczynski Oct 25, 2024

karol-kokoszka Oct 25, 2024

Michal-Leszczynski Oct 24, 2024

karol-kokoszka Oct 24, 2024

Michal-Leszczynski commented Nov 5, 2024

Parallelized cluster hosts discovering + async config cache update #4077

Are you sure you want to change the base?

Parallelized cluster hosts discovering + async config cache update #4077

Conversation

karol-kokoszka commented Oct 22, 2024

karol-kokoszka commented Oct 23, 2024

Michal-Leszczynski commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

karol-kokoszka commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Nov 5, 2024