Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HA mode with embedded DB #97

Closed
wants to merge 9 commits into from
Closed

Support HA mode with embedded DB #97

wants to merge 9 commits into from

Conversation

St0rmingBr4in
Copy link
Collaborator

@St0rmingBr4in St0rmingBr4in commented Oct 28, 2020

This enables initializing a cluster in HA mode with an embedded DB.
https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/

When multiple masters are specified in the master group, k3s-ansible will add the necessary flags during the initialization phase (i.e. --cluster-init and --server)

For the embedded HA mode to work the k3s version must be >= v1.19.1

Closes #32

@St0rmingBr4in
Copy link
Collaborator Author

Right now nodes are registering using a non HA endpoint. Either this playbook needs to create such endpoint like when loadbalancer_apiserver_localhost is used in kubespray, or ask the user to provide an external loadbalancer endpoint. https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md

@St0rmingBr4in St0rmingBr4in force-pushed the k3s-ha branch 2 times, most recently from ffa42b8 to 50ee597 Compare December 12, 2020 21:27
@St0rmingBr4in St0rmingBr4in changed the title Support Fedora & Support HA mode with embedded DB Support HA mode with embedded DB Feb 14, 2021
@St0rmingBr4in St0rmingBr4in force-pushed the k3s-ha branch 5 times, most recently from fd61649 to b2a9f7c Compare February 14, 2021 21:38
@GideonStowell
Copy link

Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.

@St0rmingBr4in
Copy link
Collaborator Author

Is there any particular thing that is blocking from this being merged? I am finding that we need the HA functionality in our deployments. Let me know if there is anything I can do to help.

I think it's ready to merge but if you want to try it first I would be very interested in your feedback :)

@GideonStowell
Copy link

I can give it a go. I invoke it as a dependency in a ansible role, so it won't be testing as is, but it could still be a useful to see.

@St0rmingBr4in
Copy link
Collaborator Author

Okay 👌 Let me know how it goes :)

edenreich added a commit to edenreich/k3s-ansible that referenced this pull request Feb 23, 2021
change variable name(tbm after k3s-io#97)
@narkaTee
Copy link

narkaTee commented Mar 5, 2021

I tested the current version and it works fine when running the first time.
Resetting the cluster with reset.yml and rerunning the playbook also works fine.

It does break when re-running the playbook after the cluster has been successfully setup.
The "Verify that all nodes actually joined" task fails:

TASK [k3s/master : Verify that all nodes actually joined] ***********************************************
fatal: [master-0]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-2]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}
fatal: [master-1]: FAILED! => {"msg": "The conditional check 'nodes.rc == 0 and ((nodes.stdout | from_json)['items'] | json_query('[*].metadata.labels.\"node-role.kubernetes.io/master\"') | count) == (groups['ma
ster'] | length)' failed. The error was: Expecting value: line 1 column 1 (char 0)"}

I think the error is caused by the node(s) without the node-role.kubernetes.io/master label.
Any idea how to properly debug the unitl-expression?
I tried using https://jmespath.org/ with the output from the api and using the jmespath expression, but the website prints the correct result.

@St0rmingBr4in
Copy link
Collaborator Author

Yes, the playbook tries to verify that all masters joined the cluster. I suspect that they are each creating a 1 node cluster. But I don't know why. I've tried inside vagrant VMs but I don't quite understand where is the problem at the moment.

roles/k3s/master/tasks/main.yml Outdated Show resolved Hide resolved
@St0rmingBr4in St0rmingBr4in requested a review from itwars March 14, 2021 12:48
@St0rmingBr4in
Copy link
Collaborator Author

@itwars I added you as reviewer, I think the change is ready to merge now, if you could take a look I would greatly appreciate.

@mattthhdp
Copy link

on 2 fresh ubuntu (for testing purposes with the embedded db) i got the 20 retries

 fatal: [rancher-02]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.174167", "end": "2021-03-16 23:18:28.384937", "rc": 0, "start": "2021-03-16 23:18:28.210770", "stderr": "", "stderr_lines": [], "stdout": "rancher-02", "stdout_lines": ["rancher-02"]}
fatal: [rancher-01]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.171841", "end": "2021-03-16 23:18:28.568724", "rc": 0, "start": "2021-03-16 23:18:28.396883", "stderr": "", "stderr_lines": [], "stdout": "rancher-01", "stdout_lines": ["rancher-01"]}

any log or test you want me to try ?

@St0rmingBr4in
Copy link
Collaborator Author

St0rmingBr4in commented Mar 18, 2021

any log or test you want me to try ?

To debug you need to access the logs of the k3s-init service. Using journalctl -ef -u k3s-init while the playbook is running can give info on why it is failing. Also to start fresh running the reset playbook is a good idea after a failing run.

I will add a quick message to state how to debug errors if the verify task fails.

@mattthhdp
Copy link

Im cross-posting the resul here from #32
sorry for my bad english im trying my best! hope everything is clear

Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325203 9900 reconciler.go:319] Volume detached for volume "helm-traefik-token-fccln" (UniqueName: "kubernetes.io/secret/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-helm-traefik-token-fccln") on node "rancher-01" DevicePath ""
Mar 17 23:15:12 Rancher-01 k3s[9900]: I0317 23:15:12.325209 9900 reconciler.go:319] Volume detached for volume "values" (UniqueName: "kubernetes.io/configmap/14abc0b9-dd4c-4068-832f-2e0c14dfebf2-values") on node "rancher-01" DevicePath ""
Mar 17 23:15:13 Rancher-01 k3s[9900]: W0317 23:15:13.044943 9900 pod_container_deletor.go:79] Container "2cded66bbca92e6097862fe9aa41fbc37b2b9ca6714d7032c6cf809b48eea7b5" not found in pod's containers
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740228 9900 remote_runtime.go:332] ContainerStatus "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740251 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": rpc error: code = NotFound desc = an error occurred when try to find container "9ee5cb4ef55ce4ec5eccd4daa370c3fae32f9a272be3a8bf2f69bef601935113": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740607 9900 remote_runtime.go:332] ContainerStatus "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740624 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": rpc error: code = NotFound desc = an error occurred when try to find container "31dfe6d9433d6514f0ae493c96bf676766d50d662f06756a3a3bcbe442299c92": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.740862 9900 remote_runtime.go:332] ContainerStatus "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.740875 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": rpc error: code = NotFound desc = an error occurred when try to find container "2b5aacb88408e7fccb2d20ea5329940868c84c1b7001e776305db8921783e270": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741108 9900 remote_runtime.go:332] ContainerStatus "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741121 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": rpc error: code = NotFound desc = an error occurred when try to find container "c106eef3ec36f50eacb77657bc585c0220470e85fc3af835197c7d8b2b6f155e": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741329 9900 remote_runtime.go:332] ContainerStatus "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741350 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": rpc error: code = NotFound desc = an error occurred when try to find container "1e4c82f83883e53eb46661ae88af4d1e7178b5b092c9f139064fec8f956c9534": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741590 9900 remote_runtime.go:332] ContainerStatus "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741610 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": rpc error: code = NotFound desc = an error occurred when try to find container "0ddb717776d77e61a0f748e49024b20a9d80f86ac837aadb20c8233c7582514d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: E0317 23:15:46.741889 9900 remote_runtime.go:332] ContainerStatus "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d" from runtime service failed: rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:15:46 Rancher-01 k3s[9900]: I0317 23:15:46.741904 9900 kuberuntime_gc.go:360] Error getting ContainerStatus for containerID "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": rpc error: code = NotFound desc = an error occurred when try to find container "172ce2cb8a31d71725f8e92713ff8a4b51706c8f90ad60f24aa0e2fc3cb06b1d": not found
Mar 17 23:18:05 Rancher-01 systemd[1]: Stopping /usr/local/bin/k3s server --cluster-init...
Mar 17 23:18:05 Rancher-01 k3s[9900]: I0317 23:18:05.436838 9900 network_policy_controller.go:157] Shutting down network policies full sync goroutine
Mar 17 23:18:05 Rancher-01 k3s[9900]: {"level":"warn","ts":"2021-03-17T23:18:05.444Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock 0 }. Err :connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting..."}

@St0rmingBr4in
Copy link
Collaborator Author

@mattthhdp Normally k3s-ansible should be able to entirely replace the https://get.k3s.io script.

Tell me if I'm wrong but to sum up:

  • k3s-ansible does not work on your VMs (Both with and without the Multi master support changes).
  • Running the https://get.k3s.io script works to provision a 1 node cluster.
  • You did not tried running a multi master cluster with the https://get.k3s.io script yet.

If that's true then I think we should create another issue to keep track of your problem. Since k3s-ansible does not work even without the changes of this PR then we should fix your problem separately of this PR. Would you mind creating a new issue summarizing what you did and the full logs ?

Copy link

@orzen orzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize that this playbook was limited to only one master until I've been working with it for a couple of weeks.

I would be happy to see this patch being merged!

cmd: "systemd-run -p RestartSec=2 \
-p Restart=on-failure \
--unit=k3s-init \
k3s server {{ server_init_args }}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not k3s server require {{ server_init_args }} in order to connect to the "cluster master" and if so isn't more suitable to add {{ server_init_args }} to k3s/master/template/k3s.service.j2?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the {{ server_init_args }} are only used for the inital startup to create the embedded etcd cluster. After that the startup works the same as always.
Take a look at the defaults file and the rancher docs for reference.

At least that's my understanding of this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a bit unclear, however I got it clarified from a k3s maintainer. That these are only required for the initial setup.

@jon-stumpf
Copy link
Contributor

According to this article, CentOS 7 can now run systemd 231. Has anyone tried this?

@3deep5me
Copy link

3deep5me commented Apr 9, 2022

Hi guys,

any news here? I am new to GitHub and stuff but it seams to @itwars has to approve the PR?
Are there any chances to merge the feature into master soon?
I didn't see the branch so i tried several hours to modify the playbooks to do an install with HA control-plane.

@vasanthaganeshk
Copy link

Hello guys, any plans to merge this?

@bennnnnnnn
Copy link

Also hoping for this to get merged

@kingphil
Copy link

kingphil commented Jun 17, 2022

I have a workaround for this, namely, run the playbook twice. The first invocation is limited "master[0]".

Additionally, leverage inventory/$CLUSTER/group_vars/master (example below using MySQL as external database):
extra_server_args: '--token $TOKEN --datastore-endpoint "mysql://$USER:$PASS@tcp($HOST:3306)/$CLUSTER"'

Run 1: ansible-playbook site.yml --limit master[0]
Run 2: ansible-playbook site.yml

@3deep5me
Copy link

Is it maybe an option to not support centos 7?
I don't know the facts, but i think normally if you create a new cluster in 2022 you do not use a distro which is unsupported in 2024.
I think the project will benefit more of the ha-feature than support of a distro from 2014.

@bryopsida
Copy link

Does this just need to be re-tested with centos7 and synced to be integrated? Is there anything I can help with?

{% if ansible_host == hostvars[groups['master'][0]]['ansible_host'] | default(groups['master'][0]) %}
--cluster-init
{% else %}
--server https://{{ hostvars[groups['master'][0]]['ansible_host'] | default(groups['master'][0]) }}:6443
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, wouldn't this be equivalent to apiserver_endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apiserver_endpoint could be something other than a master host. It can be a load balancer, VIP, or something similar. You'll want to use that endpoint once the cluster is running but just to start it you just connect to any existing master host.

Comment on lines 22 to 23
args:
warn: false # The ansible systemd module does not support transient units
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warn arg was removed in 2.14

@jrstarke
Copy link

@St0rmingBr4in are you still looking to merge this, or is this PR effectively dead at this point?

@St0rmingBr4in
Copy link
Collaborator Author

I don't have the time to do it, if someone wants to open a new PR I'll gladly review it and merge it.

St0rmingBr4in and others added 8 commits November 7, 2023 14:03
This enables initializing a cluster in HA mode with an embedded DB.
https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/

When multiple masters are specified in the master group, k3s-ansible will add
the necessary flags during the initialization phase.
(i.e. --cluster-init and --server)

For the embedded HA mode to work the k3s version must be >= v1.19.1

Signed-off-by: Julien DOCHE <[email protected]>
This replaces the `master_ip` var by `apiserver_endpoint` for genericity. The
init service is deployed only when k3s.service is not present on the machine to
ensure idempotence.

Signed-off-by: Julien DOCHE <[email protected]>
Signed-off-by: Julien DOCHE <[email protected]>
Allows specifying an alternate port value for the loadbalanced
apiserver endpoint using a new 'apiserver_port' variable. This is
required if port 6443 is already in use on the loadbalancer.

Signed-off-by: Brian Brookman <[email protected]>
Signed-off-by: Brian Brookman <[email protected]>
* Remove unsupported command module "warn" parameter

As of ansible-core 2.14.0, the builtin command module no longer
has a "warn" parameter. Using it causes a fatal error that stops
playbook completion.

Signed-off-by: Brian Brookman <[email protected]>
Signed-off-by: Derek Nola <[email protected]>

* Relax yamllint rules on spaces inside braces

Signed-off-by: Derek Nola <[email protected]>

* Fix lint

Signed-off-by: Derek Nola <[email protected]>

---------

Signed-off-by: Brian Brookman <[email protected]>
Signed-off-by: Derek Nola <[email protected]>
Co-authored-by: Julien DOCHE <[email protected]>
Co-authored-by: Derek Nola <[email protected]>
@dereknola
Copy link
Member

Closed in favor of #210

@dereknola dereknola closed this Nov 8, 2023
@dereknola dereknola deleted the k3s-ha branch November 15, 2023 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-master support for K3s Ansible playbook?