Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implement LBaaS in Yaook #719

Open
2 tasks
anjastrunk opened this issue Aug 29, 2024 · 16 comments
Open
2 tasks

[Feature Request] Implement LBaaS in Yaook #719

anjastrunk opened this issue Aug 29, 2024 · 16 comments
Assignees
Labels
enhancement New feature or request SCS-VP10 Related to tender lot SCS-VP10

Comments

@anjastrunk
Copy link
Contributor

anjastrunk commented Aug 29, 2024

Yaook as a further implementation of SCS standards, does not support a standard conform load balancer, yet. We have to provide one. At this, the only requirement is to provide a OpenStack conform endpoint to the user. The behavior behind the sense does not matter.

Tasks:

This issue is related to #587, which standardizes mandatory and recommended IaaS Service and LBaaS should be part of it.

@anjastrunk anjastrunk added enhancement New feature or request SCS-VP10 Related to tender lot SCS-VP10 labels Aug 29, 2024
@markus-hentsch
Copy link
Contributor

Evaluate options for LBaaS

FTR, here is one of the main problems that prevented integration of Octavia in Yaook so far: https://storyboard.openstack.org/#!/story/2007370#comment-153426

In Yaook, any database instances are running behind HAProxy instances. This does seem to lead to severe problems with Octavia in production, according to the linked issue.

We should at least consider having a look at improving Octavia and/or its integration as re-implementing the whole Octavia LBaaS v2 API using a different LB framework will be no easy feat either.

We should get in touch with @horazont and check if there were other issues observed with Octavia than the one mentioned above that would need to be addressed as well.

@markus-hentsch
Copy link
Contributor

I had a discussion with @horazont about this:

  • the upstream issue report1 seems to suggest that the issue is a race condition between a) Octavia API instructing its workers via RPC and b) MariaDB syncing the database write of the Octavia API to the other replicas in conjunction with the workers attempting to read the entry while being scheduled to a different DB replica through HAProxy that has not yet received the sync
    • however, @horazont said he is not convinced that this actually is the problem since the HAProxy instances are configured to always schedule the DB queries to the first DB replica in Yaook
    • we should reproduce and analyze the issue
  • there seems to be an OVN backend driver2 for Octavia, we should have a look at that one too
  • creating an Octavia alternative with full API compatibility is a huge task and we should try all other options of getting Octavia working correctly in Yaook first, I think

Footnotes

  1. https://storyboard.openstack.org/#!/story/2007370#comment-153426

  2. https://docs.openstack.org/ovn-octavia-provider/latest/admin/driver.html

@markus-hentsch
Copy link
Contributor

While discussing the topic in a small topic kickoff with @kgube, @josephineSei and @kitsudaiki we identified the following tasks:

  • Get in touch with the relevant CSPs and check if the SCS reference implementation ever experienced issues like the one mentioned above1.
  • Research which subset of the Octavia API is actually used and strictly needed by the KaaS part of the SCS reference implementation.
  • Identify all possible use cases that the Octavia API offers and how each can be tested.
  • Implement an Octavia operator prototype for Yaook to integrate Octavia in Yaook.
  • Test the Octavia integration in Yaook, try to reproduce the original issue1 and find a fix for it.

Note that aside from the last point most of these tasks are independent and can be addressed in parallel.

Footnotes

  1. https://storyboard.openstack.org/#!/story/2007370#comment-153426 2

@berendt
Copy link
Contributor

berendt commented Oct 1, 2024

@markus-hentsch In Kolla (used by OSISM by default to deploy OpenStack) there are 2 ways to access the MariaDB Galera cluster: HAPRoxy + ProxySQL. With both ways all nodes in a cluster access the database through the same node. This is the node that holds the primary IP address managed by Keepalived. If Keepalived is not used and the database is accessed otherwise and possibly not via only one node, I think Galera ensures that the information is identical on all nodes because Galera implements a multi-master cluster.

@kitsudaiki
Copy link

Started with a first prototypical octavia-operator for YAOOK. Reactivated the old issue on gitlab regarding the octavia integration ( https://gitlab.com/yaook/operator/-/issues/186 ) and create a new brauch for an octavia-operator ( https://gitlab.com/yaook/operator/-/tree/feature/add-octavia-operator ) and octavia docker-image for YAOOK ( https://gitlab.com/yaook/images/octavia/-/tree/feature/initial-version ) for the implementation.

@berendt
Copy link
Contributor

berendt commented Oct 2, 2024

Started with a first prototypical octavia-operator for YAOOK. Reactivated the old issue on gitlab regarding the octavia integration ( https://gitlab.com/yaook/operator/-/issues/186 ) and create a new brauch for an octavia-operator ( https://gitlab.com/yaook/operator/-/tree/feature/add-octavia-operator ) and octavia docker-image for YAOOK ( https://gitlab.com/yaook/images/octavia/-/tree/feature/initial-version ) for the implementation.

For Amphora images you can use https://github.com/osism/openstack-octavia-amphora-image. I will add 2024.2 images later this day.

@josephineSei
Copy link
Contributor

I was looking through the Octavia documentation today and tried to identify use cases and features.

Different drivers / possible other choices:

At first there it can be noted, that the VM or container with the load-balancer itself (amphora) can be substituted by other options(see: https://docs.openstack.org/octavia/latest/admin/providers/index.html):

  1. Amphora: the reference driver from the Octavia project
  2. A10 Networks OpenStack Octavia Driver: for Thunder, vThunder and AX Series Appliances
  3. F5 Networks Provider Driver for OpenStack Octavia by SAP SE
  4. OVN Octavia Provider Driver
  5. Radware Provider Driver for OpenStack Octavia
  6. VMware NSX

From all these drivers that could be used only OVN is compared to Amphora in several feature matrices here. Looking through this document, it can be seen that there is a huge gap that seems to mainly be around everything needed for Layer 7 loadbalancing, which OVN does not support.

Example Use Cases

Octavia guides for basic and Layer 7 load-balancing give many examples for use cases.

It could be very coarsely divided into:

  • Load-balancers for each: UDP, TCP, HTTP, HTTPS
  • Applying Health Monitors to you Load-Balancer
  • Applying TLS-termination
  • Applying Layer 7 Loadbalancing rules including authentication, redirecting requests with invalid certificates, using cookies

Problem: TLS

To make use of TLS termintation or re-encryption a deployment with a working Key-Manager is needed.
For the scs project this means, we cannot make use of these features, because we do not mandate having a key-manager within a deployment.

Further Features

There are other features that come with Octavia, that are useful for operators mostly:

  • Amphora Log Offloading (via syslog over the lbaas-management network)
  • API Auditing (using Oslo messaging notifier -> could be routed to e.g. a log file)
  • API Health Monitoring
  • Octavia Flavors (predefined sets of provider configuration options // defined per provider driver)
  • Amphora Failover Circuit Breaker (threshold for failovers to prevent mass failovers)
  • using SR-IOV ports for Amphorae (increasing performance)

@markus-hentsch
Copy link
Contributor

This might be important concerning Octavia with OVN backend: osism/issues#959

@garloff
Copy link
Member

garloff commented Oct 9, 2024

To make our Cluster Stacks working (or other Cluster API solutions) without special hacks, we need LBaaSv2 loadbalancers at two places:
(1) In front of kube-api (created by capo)
(2) In front of a deployed ingress controller (or gateway) (created by OCCM)
Neither requires TLS termination to work.
(TLS termination is a feature often desired by users of VM-based workloads, so you may still consider the option to have it.)

I have been trying to use the OVN provider instead of amphorae in Cluster-API-Provider (KaaS-v1), because this makes the thing much more resource-efficient, more reliable and also allows for seeing the client-IPs. (In general, I'm more convinced of the design of doing L3 loadbalancing right at the network level and leaving the L7 complexity to some other place.)

Historic information is here, updated by
https://github.com/SovereignCloudStack/k8s-cluster-api-provider/blob/main/Release-Notes-R6.md#ovn-lb
I don't remember the status of configuring OVN provider LB for cluster-stacks; I know it was on the list of things to do.
Maybe @jschoone or @chess-knight or @lindenb1 can comment on it.

@kitsudaiki kitsudaiki self-assigned this Oct 10, 2024
@kitsudaiki
Copy link

Current state of the implementation of the octavia-operator for YAOOK:

  • database and message-queue for octavia are up and running
  • all octavia-services are up and running, while each service runs within its own pod
  • octavia-api is reachable and does respond on openstack-client requests
  • configuration of the load balancer management network and octavia-specific certificate-configuration still in progress, so the octavia-configuration is not complete at the moment

@kitsudaiki
Copy link

kitsudaiki commented Oct 18, 2024

Update status of the octavia-operator in YAOOK:

Additional to the last week missing network-configuration and the certificates, some other problems appeared while testing, which were fixed thanks to debug-support by @markus-hentsch. Current prototype of the octavia-operator with amphora works. The setup was successfully tested so far by creating a load-balancer with 2 VMs behind the load-balancer and successfully accessed both VMs round-robin over the floating-ip, which was bound to the load-balancer.

The current state is only a first prototype and not ready for merge to the main-branch. Still open tasks:

  • cleanup code (remove debug stuff and so on)
  • write documentation
  • automate certificate creation (this was done manually for the test, because of the passphrase required by octavia for the keys)
  • cleaner solution for the traffic between the amphora load-balancer-VM and the health-manager
  • write unit- and integration-tests

@kgube
Copy link
Contributor

kgube commented Oct 21, 2024

Regarding the SCS KaaS requirements for load balancing: there is a draft for a DR (though it might be changed to a standard) that requires the Service type LoadBalancer.
This is only L3/L4 load balancing, which could well be provided by Octavia's OVN backend.

There is also an octavia ingress controller, which uses the LBaaS L7 capabilities to implement the Ingress API.
We currently don't require an Ingress controller to be created in KaaS clusters, but even if that changes, there are Pod-based ingress controllers that work behind a Service of type LoadBalancer and do not require external L7 support.

@anjastrunk anjastrunk moved this from Backlog to Doing in Sovereign Cloud Stack Oct 28, 2024
@kitsudaiki
Copy link

Restructured code and added basic documentation. Created PR in Draft-state so far to make the checks of the CI-pipeline green: https://gitlab.com/yaook/operator/-/merge_requests/2679 , but unit- and integration-tests still have to be done

@kitsudaiki
Copy link

While trying to add a second provider network to my test-deployment, which should work as load-balancer manger network for further tests, I accidentally broke my OVN by an invalid configuration, which takes/took me some time to fix.

@markus-hentsch
Copy link
Contributor

While trying to add a second provider network to my test-deployment, which should work as load-balancer manger network for further tests, I accidentally broke my OVN by an invalid configuration, which takes/took me some time to fix.

I removed the Nova and Neutron layer from the test deployment again to completely wipe the networking stuff, cleaned up any remnants and redeployed both services. It should work again now.

@kitsudaiki
Copy link

Added unit- and integration-tests for the new octavia-operator in YAOOK and after testing also with dedicated load-balancer-provider-network, I marked the merge-requests for review:

(the octavia-image has to be merged first in order so set valid image-tags, so the operator is still in draft-mode at the moment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Doing
Development

No branches or pull requests

7 participants