Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add leader election retry #66

Merged
merged 37 commits into from
Feb 7, 2024
Merged

add leader election retry #66

merged 37 commits into from
Feb 7, 2024

Conversation

samos123
Copy link
Contributor

@samos123 samos123 commented Feb 3, 2024

  • adds a test that ensures apiserver going away for 20 seconds would force a successfuly retry of leader election

For reference, see Kong gateway that had to implement the same thing: Kong/kubernetes-ingress-controller#578

Fixes #60

* adds a test that ensures apiserver going away for 20 seconds would
  force a successfuly retry of leader election
@samos123 samos123 requested a review from nstogner February 3, 2024 17:12
@samos123 samos123 force-pushed the fix-leader-election-retry branch from f144193 to e365dd7 Compare February 3, 2024 23:06
@samos123 samos123 changed the title fix leader election retry add leader election retry Feb 3, 2024
}
}
log.Printf("Starting leader election process. RetryCount: %v", retryCount+1)
leaderelection.RunOrDie(ctx, le.config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that RunOrDie eventually exits if it loses connection to the API Server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what seems to end up happening. This Kong PR has more details: Kong/kubernetes-ingress-controller#578

Copy link
Contributor Author

@samos123 samos123 Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double confirmed this by checking the logs and seeing how often it had to retry (re-run RunOrDie) when the apiserver is down for ~2 minutes

@samos123
Copy link
Contributor Author

samos123 commented Feb 4, 2024

I verified it's working also when the connection is lost for longer duration:

E0204 05:05:39.433070       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 client rate limiter Wait returned an error: context deadline exceeded                                          
I0204 05:05:39.433094       1 leaderelection.go:285] failed to renew lease default/lingo.substratus.ai: timed ou
t waiting for the condition                                                                                     
2024/02/04 05:05:39 Stopped leading                                                                             
2024/02/04 05:05:39 Leader election failed, retrying in 1s. RetryCount: 4                                       
2024/02/04 05:05:40 Starting leader election process. RetryCount: 4
I0204 05:05:40.433273       1 leaderelection.go:250] attempting to acquire leader lease default/lingo.substratus
.ai...                                                                                                          
E0204 05:05:40.433864       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused                                                                     
2024/02/04 05:05:42 Not leader, doing nothing                                                                   
E0204 05:05:42.888069       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused                                                                     
2024/02/04 05:05:45 Not leader, doing nothing                                                                   
E0204 05:05:45.217642       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused  
2024/02/04 05:05:48 Not leader, doing nothing                            
E0204 05:05:49.291477       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused  
2024/02/04 05:05:51 Not leader, doing nothing                                                                   
E0204 05:05:51.652288       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
E0204 05:05:53.781887       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
2024/02/04 05:05:54 Not leader, doing nothing
2024/02/04 05:05:57 Not leader, doing nothing
E0204 05:05:57.501245       1 leaderelection.go:332] error retrieving resource lock default/lingo.substratus.ai:
 Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": dial tcp
 10.96.0.1:443: connect: connection refused
2024/02/04 05:06:00 Not leader, doing nothing
I0204 05:06:01.659294       1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai
2024/02/04 05:06:01 Started leading
2024/02/04 05:06:03 Calculating scales for all
2024/02/04 05:06:03 Leader active requests for: stapi-minilm-l6-v2: 0
2024/02/04 05:06:03 Leader active requests for: kubernetes: 0
                  

@samos123 samos123 requested a review from nstogner February 4, 2024 05:15
Copy link
Contributor

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool way of testing this in e2e tests.

return
default:
if retryCount > 0 {
backoff.Next(backoffID, backoff.Clock.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I didnt know about this library

backoff.Next(backoffID, backoff.Clock.Now())
delay := backoff.Get(backoffID)
log.Printf("Leader election failed, retrying in %v. RetryCount: %v", delay, retryCount+1)
<-time.After(delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.After does not respect context cancellation. I put together what I think would be a better way to go about this loop here as an example: https://github.com/substratusai/lingo/compare/example-leader-election-loop ... I played around with this implementation and it exits immediately on context cancellation while still accomplishing expo-backoff (go run ./hack/retry ... ctrl-c). This is done by moving time.After into the select cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I had this in a select but in favor of simplifying I removed it since I thought the higher level select was good enough. I will move it back in, makes senses, thanks for validating this with a quick experiment!

sleep 120
docker exec ${KIND_NODE} iptables -D INPUT -p tcp --dport 6443 -j DROP

until kubectl get pods; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is until kubectl get pods; do doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so even after the iptable rule is removed it takes sometimes some time for kubernetes to recover. This until statement will wait until kubectl get pods start working again, meaning only if kubectl get pods returns exit code 0 it will continue.

done

# rerun kubectl logs because previous one got killed when apiserver was down
kubectl logs --tail=500 -f deployment/lingo &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want --tail=-1 if you are trying to get at all logs since restart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to get all the logs which may already have 1000+ entries and only the last 500. I tried and last 100 wasn't enough but last 500 was more than enough.

echo "Waiting for deployment to scale down back to 0 within 2 minutes"
# Verify that leader election works by forcing a 120 second apiserver outage
KIND_NODE=$(kind get nodes --name=substratus-test)
docker exec ${KIND_NODE} iptables -I INPUT -p tcp --dport 6443 -j DROP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 6443 the port of the API Server running on kind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep and credit to chatGPT with coming up with the high-level idea to simply block the traffic.

k describe svc kubernetes 
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.0.1
IPs:               10.96.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         172.18.0.2:6443
Session Affinity:  None
Events:            <none>

Utilize select in waiting for backoff to handle scenario of context
cancelled while waiting.

Add returning of error so any errors are propogated in the log files
@samos123 samos123 requested a review from nstogner February 4, 2024 20:20
Copy link
Contributor

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work 🏄
I added some minor comments

err := le.Start(ctx)
if err != nil {
setupLog.Error(err, "starting leader election")
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the context is canceled on a graceful shutdown (sigterm) then Start() would return the "cancel" error that results in an Exit(1).
This error handling strategy is used in other places in this file as well. Instead, we could wrap the context in L110
ctx, cancel := context.WithCancel(ctrl.SetupSignalHandler()) and call cancel() instead. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the error comes from the context already, it may be simpler to do nothing and let the go routine complete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully understand how to change it. Could you elaborate? Are you saying instead of calling os.Exit() just call cancel()?

I didn't see any other example of using cancel on the context in main.go. I tried to follow the same pattern as the other start functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only error returned from Start() is coming from the context, now. Thanks to your for loop. Calling cancel on the context again is not needed. I would not return an error in Start() and let the Go routine complete. The main thread should handle the shutdown. I should have deleted my first comment. Sorry for the confusion.
To elaborate for a different use case where other errors can happen, this is an example:

	ctx, cancel := context.WithCancel(ctrl.SetupSignalHandler())
	go func() {
		// in subroutine
		if err:= doSomething(ctx); err!=nil{
			// todo: log error
			cancel()
		}
	}()
	// main thread
	if err:=doSomethingElse(ctx);err!=nil{
		// todo: log error
		os.Exit(1)
	}

Please note that the error is not returned to the main thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that example! I think I get it now. To me it feels cleaner to just directly exit instead of cancelling a context which then may in the future cause an exit? I favor being explicit and going to the exit path as soon as it makes sense.

@@ -79,7 +79,7 @@ func (s *scaler) compareScales(current, desired int32) {
s.desiredScale = desired
}

if s.currentScale == -1 || s.desiredScale == -1 {
if s.currentScale == -1 || s.desiredScale == -1 || desired == -1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually solve issue #67 . So I might revert this

@@ -35,11 +36,11 @@ func NewElection(clientset kubernetes.Interface, id, namespace string) *Election
RetryPeriod: 2 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
log.Println("Started leading")
log.Printf("%v started leading", id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %v does the job but it is more common to use %s for string or %q to have a quoted string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about %q that might actually be nice here.

pkg/leader/election.go Show resolved Hide resolved
tests/e2e/test.sh Outdated Show resolved Hide resolved
@samos123 samos123 force-pushed the fix-leader-election-retry branch from 68f7c53 to 610ba31 Compare February 6, 2024 06:53
@samos123
Copy link
Contributor Author

samos123 commented Feb 6, 2024

Any ideas why it's not scaling back to 0? https://github.com/substratusai/lingo/actions/runs/7795813038/job/21259370590?pr=66

@samos123
Copy link
Contributor Author

samos123 commented Feb 6, 2024

I am giving up on fixing #67 in this PR. I think fixing #67 will first require a rewrite and unit tests for scaler since I can't seem to get a hang of the current logic.

@samos123 samos123 force-pushed the fix-leader-election-retry branch 2 times, most recently from 64fecb4 to 7310411 Compare February 7, 2024 04:17
@samos123 samos123 merged commit 30e555d into main Feb 7, 2024
7 of 8 checks passed
@samos123 samos123 deleted the fix-leader-election-retry branch February 7, 2024 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lingo not leader even though there is only 1 replica after long pod uptime
3 participants