-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-master support for K3s Ansible playbook? #32
Comments
This role does this: |
Would love to have this supported by Rancher out of the box. Any plans on implementing this feature in this repository? |
Now that internal HA is implemented, even if only experimental, does this playbook automatically install and set up a cluster to utilise this feature already or if not, can it be updated to support this? |
I've implemented the base mechanism in PR #97 It setups the HA with the embedded etcd database mode. |
I'd personally love to but my pitiful single node cluster is clearly not going to be much help! I'd love to know if someone else can confirm it though. This would be so awesome. |
Maybe it's time to create a proper vagrant file to test it, or finish the work in #52 to test it inside docker. |
Would love this as well |
How is this going? I see the PR looks pretty ready to go. Did it work nicely in testing? |
Actually it's working pretty well, I use it to deploy my cluster at home. The only thing missing is to have a variable that stores the loadbalanced endpoint that slaves need to use to talk to the api server. I will work on this tomorrow this has been in a stalled state for far too long. |
Sounds brilliant. |
@St0rmingBr4in just getting around to giving this a go (as my new Rpi 4s arrived in the mail!) and I notice you put a comment in your commit stating
Which I'm not sure I fully understand. In the official K3S docs it appears adding a highly available second and third master is just a matter of joining them to the first.
with no mention of a highly available API endpoint. Is there something I'm missing? |
For an HA to work, you need a IP living on a load balancer (or some other IP front facing answer, like round robin DNS (not recommended, just an example), so that if node 1 fails, the "HA IP" remains stable and working. If node 1 is also the primary IP, when it fails, there is no way to talk to the others at that IP. |
Is this something that might need to be added to those K3S docs to make it more clear? I can't seem to find mention of this being a concern (as it seems like a pretty big one). I'm nowhere near talented enough to do digging through their code but I just thought there might be some sort of quorum system where all the nodes designated as masters might share their IPs, and use heartbeats to figure out if one is down. Am I way off base thinking that? My concern is I fear going to the point of using a load balancer might be introducing yet another single point of failure unless I then run the load balancer in HA and so on. |
Hi, There are actually multiple ways to configure master and slaves in HA mode. Here I am implementing the "External, no internal" endpoint type as described in kubespray since it is the easiest to setup. This setup requires the user to have an external loadbalancer running so that slaves and kube cli users have an HA way of talking to the apiserver. We might also want in the future to provide this playbook with the ability to setup an HAProxy or a nginx running on each slave that will ensure we can talk to the apiserver using a HA endpoint by talking to localhost instead. (The "Local LB" endpoint type as described in kubespray) |
Ah I see so it's the workers who need a load balanced endpoint. If hypothetically you just had three masters and no workers, no load balanced endpoint would be required, right? |
Right |
Cool thanks for that. Sorry for the noob questions, hopefully if someone else is as confused they'll be able to find this clarification. |
@St0rmingBr4in I gave it a try. With your PR #97 with three mater nodes and no workers. I can't get past the following error:
And it tries all 20 times and fails. As far as I can tell the k3s.service doesn't get started and therefore k3s isn't running on the first master node. I then tried running the master branch playbook first and then the PR version (to see if that might leave the service running and take over) but it unsurprisingly failed. I tried and reset everything a few times to be sure and it seems to consistently fail unfortunately. |
There is a typo that slipped in |
Unfortunately still getting the same issue. When I run the main branch the service starts successfully but when I run PR#97 I get the following error when running
Edit: and to clarify I made sure it was the version with the typo fixed. |
it's normal for the first initialisation of the cluster this change launch k3s in the k3s-init service. could you send the logs of each of the k3s-init services? Also if you already setup a non ha cluster on the nodes, you need to run the reset playbook so that you start from a clean state |
Absolutely, I'm definitely running a reset each time. Just to confirm, would that be |
Yes you can also use |
I've attached the last 200 lines of each node's output. 1, 2 and 3 are in the order they appear in the hosts file. I initially thought the issue might have been the use of FQDN in the hosts.ini, so I reverted to IP and it does the same thing. Let me know if you need more logs. |
I'm also using FQDNs so it is not a problem. I just retested on debian buster with v1.20.4+k3s1 and v1.19.5+k3s1 I can confirm it works perfectly for me. Maybe it would help to have the full logs. Do you have time to give this another try ? If you want we could also take a look at this together. |
@St0rmingBr4in a clean install of ubunu give me this error (the 20 retry and fail) on each master node i am tring to join, with 0 worker node (if that can help)
` |
@mattthhdp Do k3s work using a one node cluster ? The error you are getting seems unrelated. The error in question is defined here https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L360 reading kubernetes/kubernetes#63336 gives a bit more information, is containerd working on your machine ? |
So with only 1 node (rancher-01)
in the ansible vm i got
and after a few second on rancher-01 i got
and finally with the command journalctl -ef -u k3s-init if that can help in /etc/rancher/k3s/k3s.yaml
hope you dosent need anything else Edit:
|
@mattthhdp this issue you are having is not related to the pr #97 since it also does not work using a one node cluster. I thing your error is related to this one k3d-io/k3d#110 |
@St0rmingBr4in any idea why the rancher script |
I am not really sure, did you run the reset playbook or run the uninstall script between each runs ? |
@St0rmingBr4in i did. Im a little short on time today. i will try to recreate my 3 vm's (clean ubuntu install wihtout any package) and test again tomorrow. |
So with 3 clean ubuntu vm. While the script is in the waiting 20 time for nodes to join the cluster i cannot issus kubectl command |
@mattthhdp I had similar error messages and came to this thread. There were a couple of things wrong in my case, that might also apply to you. First, I ran a reset on the cluster, but one node didn't reset all of the way, and kept trying to connect to the master node. I ran an ansible-playbook reset.yml on the old/bad node that was flooding the logs with handshake errors. example of handshake error:
Once it was uninstalled, the logs cleaned up a bit, but I still had the permission-denied/TLS error you were getting as well (near the end of the log). I then manually chown'ed the file it was complaining about, on the master node, and was able to access the cluster again. I am new to ansible, and not sure where it would go to make this change within the playbook. I am also new to k3s, so I am not sure if this is a good/bad idea, however it unblocked me.
|
Super late response. I never got a chance to retry this, I've started relying on the cluster in a single master form and started running things "in production" (not really in production, just in my home). But back when I was trying this I stumbled upon some info that my issue may have been related to running the OS on SD cards, which etcd finds too slow. Does that sound like it could be the issue? |
Hello, new to k3s and ansible here. I setup 3 server and 2 agent nodes cluster. If a primary server(the one where cluster-init is executed) down, the agent can communicate to two other remaining server nodes. Here's the logs from agents:
I think agent nodes have some kind of loadbalancer with health checks for all servers nodes and automatically failover. My question is, Is it expected behavior? If so then, Do we still need an external LB for api server? |
@yehtetmaungmaung K3s comes with a Service load balancer, see https://docs.k3s.io/networking#service-load-balancer for information. But what this does is expose ingress traffic on every node. So say you have nginx setup on pods in node A, with port 80. What the K3s provided balancer does is allow you to send nginx traffic to port 80 on Node B, and it will automatically route the nginx traffic to the pods on Node A. It basically fulfills a component that many cloud provided LBs (GKE EKS). What this isn't is a external load balancer. It doesn't not loadbalance ingress traffic across nodes. It doesn't replace the functionality providing a single registration point for servers as described in https://docs.k3s.io/datastore/cluster-loadbalancer |
This is something I think we might be able to get configured in the Ansible playbook, but I didn't see (at a glance at least) if it was something supported by this playbook yet; namely, a multi-master configuration with an external database: High Availability with an External DB.
In this playbook's case, maybe it would delegate the task of configuring an external database cluster to the user (e.g. use a separate Ansible playbook that builds an RDS cluster in Amazon, or a separate two or three node DB cluster on some other bare metal servers alongside the K3s cluster), but then how could we make it so this playbook supports the multi-master configuration described in the docs page linked above.
The text was updated successfully, but these errors were encountered: