add leader election retry #66

samos123 · 2024-02-03T17:09:25Z

adds a test that ensures apiserver going away for 20 seconds would force a successfuly retry of leader election

For reference, see Kong gateway that had to implement the same thing: Kong/kubernetes-ingress-controller#578

Fixes #60

* adds a test that ensures apiserver going away for 20 seconds would force a successfuly retry of leader election

nstogner · 2024-02-04T02:17:11Z

pkg/leader/election.go

+				}
+			}
+			log.Printf("Starting leader election process. RetryCount: %v", retryCount+1)
+			leaderelection.RunOrDie(ctx, le.config)


Is the idea that RunOrDie eventually exits if it loses connection to the API Server?

That's exactly what seems to end up happening. This Kong PR has more details: Kong/kubernetes-ingress-controller#578

I double confirmed this by checking the logs and seeing how often it had to retry (re-run RunOrDie) when the apiserver is down for ~2 minutes

samos123 · 2024-02-04T05:07:53Z

I verified it's working also when the connection is lost for longer duration:

E0204 05:05:39.433070       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 client rate limiter Wait returned an error: context deadline exceeded                                          
I0204 05:05:39.433094       1 leaderelection.go:285] failed to renew lease default/lingo.substratus.ai: timed ou
t waiting for the condition                                                                                     
2024/02/04 05:05:39 Stopped leading                                                                             
2024/02/04 05:05:39 Leader election failed, retrying in 1s. RetryCount: 4                                       
2024/02/04 05:05:40 Starting leader election process. RetryCount: 4
I0204 05:05:40.433273       1 leaderelection.go:250] attempting to acquire leader lease default/lingo.substratus
.ai...                                                                                                          
E0204 05:05:40.433864       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused                                                                     
2024/02/04 05:05:42 Not leader, doing nothing                                                                   
E0204 05:05:42.888069       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused                                                                     
2024/02/04 05:05:45 Not leader, doing nothing                                                                   
E0204 05:05:45.217642       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused  
2024/02/04 05:05:48 Not leader, doing nothing                            
E0204 05:05:49.291477       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused  
2024/02/04 05:05:51 Not leader, doing nothing                                                                   
E0204 05:05:51.652288       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
E0204 05:05:53.781887       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
2024/02/04 05:05:54 Not leader, doing nothing
2024/02/04 05:05:57 Not leader, doing nothing
E0204 05:05:57.501245       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
2024/02/04 05:06:00 Not leader, doing nothing
I0204 05:06:01.659294       1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai
2024/02/04 05:06:01 Started leading
2024/02/04 05:06:03 Calculating scales for all
2024/02/04 05:06:03 Leader active requests for: stapi-minilm-l6-v2: 0
2024/02/04 05:06:03 Leader active requests for: kubernetes: 0

This is needed to trigger the issue

nstogner

Cool way of testing this in e2e tests.

nstogner · 2024-02-04T16:31:01Z

pkg/leader/election.go

+			return
+		default:
+			if retryCount > 0 {
+				backoff.Next(backoffID, backoff.Clock.Now())


Neat, I didnt know about this library

nstogner · 2024-02-04T16:32:20Z

pkg/leader/election.go

+				backoff.Next(backoffID, backoff.Clock.Now())
+				delay := backoff.Get(backoffID)
+				log.Printf("Leader election failed, retrying in %v. RetryCount: %v", delay, retryCount+1)
+				<-time.After(delay)


time.After does not respect context cancellation. I put together what I think would be a better way to go about this loop here as an example: https://github.com/substratusai/lingo/compare/example-leader-election-loop ... I played around with this implementation and it exits immediately on context cancellation while still accomplishing expo-backoff (go run ./hack/retry ... ctrl-c). This is done by moving time.After into the select cases.

originally I had this in a select but in favor of simplifying I removed it since I thought the higher level select was good enough. I will move it back in, makes senses, thanks for validating this with a quick experiment!

nstogner · 2024-02-04T16:34:31Z

tests/e2e/test.sh

+sleep 120
+docker exec ${KIND_NODE} iptables -D INPUT -p tcp --dport 6443 -j DROP
+
+until kubectl get pods; do


What is until kubectl get pods; do doing?

so even after the iptable rule is removed it takes sometimes some time for kubernetes to recover. This until statement will wait until kubectl get pods start working again, meaning only if kubectl get pods returns exit code 0 it will continue.

nstogner · 2024-02-04T16:36:15Z

tests/e2e/test.sh

+done
+
+# rerun kubectl logs because previous one got killed when apiserver was down
+kubectl logs --tail=500 -f deployment/lingo &


I think you might want --tail=-1 if you are trying to get at all logs since restart.

I didn't want to get all the logs which may already have 1000+ entries and only the last 500. I tried and last 100 wasn't enough but last 500 was more than enough.

nstogner · 2024-02-04T16:36:49Z

tests/e2e/test.sh

-echo "Waiting for deployment to scale down back to 0 within 2 minutes"
+# Verify that leader election works by forcing a 120 second apiserver outage
+KIND_NODE=$(kind get nodes --name=substratus-test)
+docker exec ${KIND_NODE} iptables -I INPUT -p tcp --dport 6443 -j DROP


Is 6443 the port of the API Server running on kind?

Yep and credit to chatGPT with coming up with the high-level idea to simply block the traffic.

k describe svc kubernetes Name: kubernetes Namespace: default Labels: component=apiserver provider=kubernetes Annotations: <none> Selector: <none> Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.96.0.1 IPs: 10.96.0.1 Port: https 443/TCP TargetPort: 6443/TCP Endpoints: 172.18.0.2:6443 Session Affinity: None Events: <none>

Utilize select in waiting for backoff to handle scenario of context cancelled while waiting. Add returning of error so any errors are propogated in the log files

nstogner

LGTM!

alpe

Good work 🏄
I added some minor comments

alpe · 2024-02-05T11:15:56Z

cmd/lingo/main.go

+		err := le.Start(ctx)
+		if err != nil {
+			setupLog.Error(err, "starting leader election")
+			os.Exit(1)


When the context is canceled on a graceful shutdown (sigterm) then Start() would return the "cancel" error that results in an Exit(1).
This error handling strategy is used in other places in this file as well. Instead, we could wrap the context in L110
ctx, cancel := context.WithCancel(ctrl.SetupSignalHandler()) and call cancel() instead. WDYT?

As the error comes from the context already, it may be simpler to do nothing and let the go routine complete

I didn't fully understand how to change it. Could you elaborate? Are you saying instead of calling os.Exit() just call cancel()?

I didn't see any other example of using cancel on the context in main.go. I tried to follow the same pattern as the other start functions.

The only error returned from Start() is coming from the context, now. Thanks to your for loop. Calling cancel on the context again is not needed. I would not return an error in Start() and let the Go routine complete. The main thread should handle the shutdown. I should have deleted my first comment. Sorry for the confusion.
To elaborate for a different use case where other errors can happen, this is an example:

ctx, cancel := context.WithCancel(ctrl.SetupSignalHandler()) go func() { // in subroutine if err:= doSomething(ctx); err!=nil{ // todo: log error cancel() } }() // main thread if err:=doSomethingElse(ctx);err!=nil{ // todo: log error os.Exit(1) }

Please note that the error is not returned to the main thread.

Thanks for that example! I think I get it now. To me it feels cleaner to just directly exit instead of cancelling a context which then may in the future cause an exit? I favor being explicit and going to the exit path as soon as it makes sense.

alpe · 2024-02-05T11:17:06Z

pkg/deployments/scaler.go

@@ -79,7 +79,7 @@ func (s *scaler) compareScales(current, desired int32) {
 		s.desiredScale = desired
 	}

-	if s.currentScale == -1 || s.desiredScale == -1 {
+	if s.currentScale == -1 || s.desiredScale == -1 || desired == -1 {


good catch 👍

It doesn't actually solve issue #67 . So I might revert this

alpe · 2024-02-05T11:25:02Z

pkg/leader/election.go

@@ -35,11 +36,11 @@ func NewElection(clientset kubernetes.Interface, id, namespace string) *Election
 		RetryPeriod:     2 * time.Second,
 		Callbacks: leaderelection.LeaderCallbacks{
 			OnStartedLeading: func(ctx context.Context) {
-				log.Println("Started leading")
+				log.Printf("%v started leading", id)


nit: %v does the job but it is more common to use %s for string or %q to have a quoted string

I didn't know about %q that might actually be nice here.

pkg/leader/election.go

tests/e2e/test.sh

samos123 · 2024-02-06T07:06:03Z

Any ideas why it's not scaling back to 0? https://github.com/substratusai/lingo/actions/runs/7795813038/job/21259370590?pr=66

Previously UpdateState was called on reconciler of deployment causing multiple lingo replicas to make different decisions. Only the leader should be making scaling decisions.

This reverts commit fae4bab.

This reverts commit 9552e9e.

This reverts commit 8bea0bf.

samos123 · 2024-02-06T08:10:01Z

I am giving up on fixing #67 in this PR. I think fixing #67 will first require a rewrite and unit tests for scaler since I can't seem to get a hang of the current logic.

fix leader election retry

75b2fd8

* adds a test that ensures apiserver going away for 20 seconds would force a successfuly retry of leader election

samos123 requested a review from nstogner February 3, 2024 17:12

samos123 added 5 commits February 3, 2024 09:17

increase sleep from 20 to 30

9a049f1

add single replica e2e tests

d8fdf7f

add more descriptive name to e2e replica test

23f6904

stream all logs of lingo

fd8725a

add retry to leader election process

e365dd7

samos123 force-pushed the fix-leader-election-retry branch from f144193 to e365dd7 Compare February 3, 2024 23:06

samos123 changed the title ~~fix leader election retry~~ add leader election retry Feb 3, 2024

increase apiserver unavailability from 30s to 60s

4b2acd8

nstogner reviewed Feb 4, 2024

View reviewed changes

recreate context if context deadline exceeded

c149556

samos123 added 2 commits February 3, 2024 21:09

ensure apiserver outage is 2 minutes

1dbb678

This is needed to trigger the issue

add log to indicate context deadline exceeded

ba46c40

samos123 requested a review from nstogner February 4, 2024 05:15

samos123 added 6 commits February 3, 2024 22:07

kubectl sometimes returns errors when apiserver went away for too long

3f1d48d

make wait for backoff blocking

7339da4

fix logging in e2e test after apiserver went down

228920b

remove unneeded check for context deadline exceeds

2405e3b

wait for apiserver to be ready

83c81ab

Add more logs in e2e test

c711ae6

nstogner requested changes Feb 4, 2024

View reviewed changes

samos123 added 2 commits February 4, 2024 11:25

address PR comments

b8fc6b7

Utilize select in waiting for backoff to handle scenario of context cancelled while waiting. Add returning of error so any errors are propogated in the log files

simplify tests and run in parallel

8236150

samos123 requested a review from nstogner February 4, 2024 20:20

samos123 added 3 commits February 4, 2024 17:36

improve GHA job names

c63fd0e

simplify leader election retry

e7cedfa

increase wait time for scale back to 0 in e2e

19a89af

nstogner approved these changes Feb 5, 2024

View reviewed changes

samos123 added 3 commits February 4, 2024 22:09

fix #67 flapping scale from 0 to 1 to 0 to 1

6ad6446

add hostname to leader log messages

38ff2ad

maybe this fixes #67

d4a3947

alpe reviewed Feb 5, 2024

View reviewed changes

samos123 added 4 commits February 5, 2024 18:32

fix PR comment, thanks Alex!

b056bce

improve string formatting

ba9d1e0

sleep for 20 sec after apiserver outage

985831e

wait wasn't long enough

610ba31

samos123 force-pushed the fix-leader-election-retry branch from 68f7c53 to 610ba31 Compare February 6, 2024 06:53

samos123 added 8 commits February 5, 2024 23:16

revert fix for #67 because it breaks scale down to 0

b7bd4cf

fix #67 only the leader should scale

8bea0bf

Previously UpdateState was called on reconciler of deployment causing multiple lingo replicas to make different decisions. Only the leader should be making scaling decisions.

simplify fix for #67 and unit tests

9552e9e

print lingo logs of all replicas on failure

d720074

in some cases state is incorrect so just scale to desired scale

fae4bab

Revert "in some cases state is incorrect so just scale to desired scale"

af888e6

This reverts commit fae4bab.

Revert "simplify fix for #67 and unit tests"

89b8a6e

This reverts commit 9552e9e.

Revert "fix #67 only the leader should scale"

3f6bab5

This reverts commit 8bea0bf.

remove broken test

7310411

samos123 force-pushed the fix-leader-election-retry branch 2 times, most recently from 64fecb4 to 7310411 Compare February 7, 2024 04:17

samos123 merged commit 30e555d into main Feb 7, 2024
7 of 8 checks passed

samos123 deleted the fix-leader-election-retry branch February 7, 2024 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add leader election retry #66

add leader election retry #66

samos123 commented Feb 3, 2024 •

edited

Loading

nstogner Feb 4, 2024

samos123 Feb 4, 2024

samos123 Feb 4, 2024 •

edited

Loading

samos123 commented Feb 4, 2024

nstogner left a comment

nstogner Feb 4, 2024

nstogner Feb 4, 2024

samos123 Feb 4, 2024

nstogner Feb 4, 2024

samos123 Feb 4, 2024

nstogner Feb 4, 2024

samos123 Feb 4, 2024

nstogner Feb 4, 2024

samos123 Feb 4, 2024

nstogner left a comment

alpe left a comment

alpe Feb 5, 2024

alpe Feb 5, 2024

samos123 Feb 6, 2024

alpe Feb 6, 2024

samos123 Feb 6, 2024

alpe Feb 5, 2024

samos123 Feb 6, 2024

alpe Feb 5, 2024

samos123 Feb 6, 2024

samos123 commented Feb 6, 2024

samos123 commented Feb 6, 2024

add leader election retry #66

add leader election retry #66

Conversation

samos123 commented Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samos123 Feb 4, 2024 • edited Loading

Choose a reason for hiding this comment

samos123 commented Feb 4, 2024

nstogner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nstogner left a comment

Choose a reason for hiding this comment

alpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samos123 commented Feb 6, 2024

samos123 commented Feb 6, 2024

samos123 commented Feb 3, 2024 •

edited

Loading

samos123 Feb 4, 2024 •

edited

Loading