-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add leader election retry #66
Changes from 16 commits
75b2fd8
9a049f1
d8fdf7f
23f6904
fd8725a
e365dd7
4b2acd8
c149556
1dbb678
ba46c40
3f1d48d
7339da4
228920b
2405e3b
83c81ab
c711ae6
b8fc6b7
8236150
c63fd0e
e7cedfa
19a89af
6ad6446
38ff2ad
d4a3947
b056bce
ba9d1e0
985831e
610ba31
b7bd4cf
8bea0bf
9552e9e
d720074
fae4bab
af888e6
89b8a6e
3f6bab5
7310411
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ import ( | |
"k8s.io/client-go/kubernetes" | ||
"k8s.io/client-go/tools/leaderelection" | ||
"k8s.io/client-go/tools/leaderelection/resourcelock" | ||
"k8s.io/client-go/util/flowcontrol" | ||
) | ||
|
||
func NewElection(clientset kubernetes.Interface, id, namespace string) *Election { | ||
|
@@ -63,5 +64,23 @@ type Election struct { | |
} | ||
|
||
func (le *Election) Start(ctx context.Context) { | ||
leaderelection.RunOrDie(ctx, le.config) | ||
backoff := flowcontrol.NewBackOff(1*time.Second, 15*time.Second) | ||
const backoffID = "lingo-leader-election" | ||
retryCount := 0 | ||
for { | ||
select { | ||
case <-ctx.Done(): | ||
return | ||
default: | ||
if retryCount > 0 { | ||
backoff.Next(backoffID, backoff.Clock.Now()) | ||
delay := backoff.Get(backoffID) | ||
log.Printf("Leader election failed, retrying in %v. RetryCount: %v", delay, retryCount+1) | ||
<-time.After(delay) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. originally I had this in a select but in favor of simplifying I removed it since I thought the higher level select was good enough. I will move it back in, makes senses, thanks for validating this with a quick experiment! |
||
} | ||
log.Printf("Starting leader election process. RetryCount: %v", retryCount+1) | ||
leaderelection.RunOrDie(ctx, le.config) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the idea that RunOrDie eventually exits if it loses connection to the API Server? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's exactly what seems to end up happening. This Kong PR has more details: Kong/kubernetes-ingress-controller#578 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I double confirmed this by checking the logs and seeing how often it had to retry (re-run RunOrDie) when the apiserver is down for ~2 minutes |
||
retryCount++ | ||
} | ||
} | ||
samos123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ set -xe | |
HOST=127.0.0.1 | ||
PORT=30080 | ||
BASE_URL="http://$HOST:$PORT/v1" | ||
REPLICAS=${REPLICAS:-3} | ||
|
||
|
||
if kind get clusters | grep -q substratus-test; then | ||
|
@@ -42,6 +43,9 @@ if ! kubectl get deployment lingo; then | |
skaffold run | ||
fi | ||
|
||
kubectl patch deployment lingo --patch "{\"spec\": {\"replicas\": $REPLICAS}}" | ||
|
||
kubectl logs -f deployment/lingo & | ||
|
||
kubectl wait --for=condition=available --timeout=30s deployment/lingo | ||
|
||
|
@@ -89,18 +93,32 @@ if [ "$replicas" -eq 1 ]; then | |
exit 1 | ||
fi | ||
|
||
echo "Waiting for deployment to scale down back to 0 within 2 minutes" | ||
# Verify that leader election works by forcing a 120 second apiserver outage | ||
KIND_NODE=$(kind get nodes --name=substratus-test) | ||
docker exec ${KIND_NODE} iptables -I INPUT -p tcp --dport 6443 -j DROP | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep and credit to chatGPT with coming up with the high-level idea to simply block the traffic.
|
||
sleep 120 | ||
docker exec ${KIND_NODE} iptables -D INPUT -p tcp --dport 6443 -j DROP | ||
|
||
until kubectl get pods; do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so even after the iptable rule is removed it takes sometimes some time for kubernetes to recover. This until statement will wait until kubectl get pods start working again, meaning only if kubectl get pods returns exit code 0 it will continue. |
||
echo "Waiting for apiserver to be back up" | ||
sleep 1 | ||
done | ||
|
||
# rerun kubectl logs because previous one got killed when apiserver was down | ||
kubectl logs --tail=500 -f deployment/lingo & | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you might want There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't want to get all the logs which may already have 1000+ entries and only the last 500. I tried and last 100 wasn't enough but last 500 was more than enough. |
||
|
||
echo "Waiting for deployment to scale down back to 0 within ~1 minute" | ||
for i in {1..15}; do | ||
if [ "$i" -eq 15 ]; then | ||
echo "Test failed: Expected 0 replica after not having requests for more than 1 minute, got $replicas" | ||
exit 1 | ||
fi | ||
replicas=$(kubectl get deployment stapi-minilm-l6-v2 -o jsonpath='{.spec.replicas}') | ||
replicas=$(kubectl get deployment stapi-minilm-l6-v2 -o jsonpath='{.spec.replicas}' || true) | ||
if [ "$replicas" -eq 0 ]; then | ||
echo "Test passed: Expected 0 replica after not having requests for more than 1 minute" | ||
break | ||
fi | ||
sleep 8 | ||
sleep 6 | ||
done | ||
|
||
echo "Patching stapi deployment to sleep on startup" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, I didnt know about this library