Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add leader election retry #66

Merged
merged 37 commits into from
Feb 7, 2024
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
75b2fd8
fix leader election retry
samos123 Feb 3, 2024
9a049f1
increase sleep from 20 to 30
samos123 Feb 3, 2024
d8fdf7f
add single replica e2e tests
samos123 Feb 3, 2024
23f6904
add more descriptive name to e2e replica test
samos123 Feb 3, 2024
fd8725a
stream all logs of lingo
samos123 Feb 3, 2024
e365dd7
add retry to leader election process
samos123 Feb 3, 2024
4b2acd8
increase apiserver unavailability from 30s to 60s
samos123 Feb 3, 2024
c149556
recreate context if context deadline exceeded
samos123 Feb 4, 2024
1dbb678
ensure apiserver outage is 2 minutes
samos123 Feb 4, 2024
ba46c40
add log to indicate context deadline exceeded
samos123 Feb 4, 2024
3f1d48d
kubectl sometimes returns errors when apiserver went away for too long
samos123 Feb 4, 2024
7339da4
make wait for backoff blocking
samos123 Feb 4, 2024
228920b
fix logging in e2e test after apiserver went down
samos123 Feb 4, 2024
2405e3b
remove unneeded check for context deadline exceeds
samos123 Feb 4, 2024
83c81ab
wait for apiserver to be ready
samos123 Feb 4, 2024
c711ae6
Add more logs in e2e test
samos123 Feb 4, 2024
b8fc6b7
address PR comments
samos123 Feb 4, 2024
8236150
simplify tests and run in parallel
samos123 Feb 4, 2024
c63fd0e
improve GHA job names
samos123 Feb 5, 2024
e7cedfa
simplify leader election retry
samos123 Feb 5, 2024
19a89af
increase wait time for scale back to 0 in e2e
samos123 Feb 5, 2024
6ad6446
fix #67 flapping scale from 0 to 1 to 0 to 1
samos123 Feb 5, 2024
38ff2ad
add hostname to leader log messages
samos123 Feb 5, 2024
d4a3947
maybe this fixes #67
samos123 Feb 5, 2024
b056bce
fix PR comment, thanks Alex!
samos123 Feb 6, 2024
ba9d1e0
improve string formatting
samos123 Feb 6, 2024
985831e
sleep for 20 sec after apiserver outage
samos123 Feb 6, 2024
610ba31
wait wasn't long enough
samos123 Feb 6, 2024
b7bd4cf
revert fix for #67 because it breaks scale down to 0
samos123 Feb 6, 2024
8bea0bf
fix #67 only the leader should scale
samos123 Feb 6, 2024
9552e9e
simplify fix for #67 and unit tests
samos123 Feb 6, 2024
d720074
print lingo logs of all replicas on failure
samos123 Feb 6, 2024
fae4bab
in some cases state is incorrect so just scale to desired scale
samos123 Feb 6, 2024
af888e6
Revert "in some cases state is incorrect so just scale to desired scale"
samos123 Feb 6, 2024
89b8a6e
Revert "simplify fix for #67 and unit tests"
samos123 Feb 6, 2024
3f6bab5
Revert "fix #67 only the leader should scale"
samos123 Feb 6, 2024
7310411
remove broken test
samos123 Feb 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion pkg/leader/election.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/client-go/util/flowcontrol"
)

func NewElection(clientset kubernetes.Interface, id, namespace string) *Election {
Expand Down Expand Up @@ -63,5 +64,27 @@ type Election struct {
}

func (le *Election) Start(ctx context.Context) {
leaderelection.RunOrDie(ctx, le.config)
backoff := flowcontrol.NewBackOff(1*time.Second, 15*time.Second)
const backoffID = "lingo-leader-election"
retryCount := 0
for {
select {
case <-ctx.Done():
return
default:
if retryCount > 0 {
backoff.Next(backoffID, backoff.Clock.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I didnt know about this library

delay := backoff.Get(backoffID)
log.Printf("Leader election failed, retrying in %v. RetryCount: %v", delay, retryCount+1)
select {
case <-time.After(delay):
case <-ctx.Done():
return
}
}
log.Printf("Starting leader election process. RetryCount: %v", retryCount+1)
leaderelection.RunOrDie(ctx, le.config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that RunOrDie eventually exits if it loses connection to the API Server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what seems to end up happening. This Kong PR has more details: Kong/kubernetes-ingress-controller#578

Copy link
Contributor Author

@samos123 samos123 Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double confirmed this by checking the logs and seeing how often it had to retry (re-run RunOrDie) when the apiserver is down for ~2 minutes

retryCount++
}
}
samos123 marked this conversation as resolved.
Show resolved Hide resolved
}
Loading