diff --git a/2018/002-salt-velum-integration.md b/2018/002-salt-velum-integration.md new file mode 100644 index 0000000..8b1c1b8 --- /dev/null +++ b/2018/002-salt-velum-integration.md @@ -0,0 +1,349 @@ +# Salt and Velum integration + +| Field | Value | +|:-------|:-----------| +| Status | Draft | +| Date | 2018-06-27 | + +## Introduction + +Salt and Velum need to integrate in order to trigger actions and notify users +about the outcome of those actions. + +### Problem description + +Salt sometimes cannot insert some events into the database because they are too +big to fit in a single transaction. This leads to Velum missing these events +and never reacting to them (never marking nodes as successfully bootstrapped, +for example). + +Debugging is also cumbersome because of the big amount of events present in +the database, when most of them are of no interest for debugging purposes. + +### Proposed change + +Instead of having salt interact with MariaDB directly it will notify Velum +using a dedicated API; only the interesting events will be notified. + +## Detailed RFC + +Examples of actions that need to be triggered from within Velum are (but not +limited to): + +* Launch bootstrap orchestration that creates the cluster for the first time +* Node management + * Node addition + * Node removal +* Modify settings and apply those settings to the cluster +* Upgrade cluster + +All these actions correspond to salt orchestrations. Velum will trigger the +orchestrations with the right parameters using the `salt-api` component. After +that, we need to read the result of those orchestrations somehow, because the +orchestration launch is an asynchronous event, and we cannot block Velum until +the orchestration result is available. + +The result of the mentioned orchestrations is what will allow Velum to perform +certain changes on the local database, for example: + +* When a node is removed + * Create a new `Orchestration` row in the database + * Mark the node to be removed as `pending` highstate in the database + * Launch the `removal` orchestration using `salt-api` + * If the orchestration is successful + * Read the parameters of the launched orchestration and remove that node from + the database + * Otherwise + * Read the parameters of the launched orchestration and mark that node as + failed in the database + * Update the `Orchestration` row + +* When a node is added + * Set the new node role using `salt-api` (e.g. `kube-master`) + * Create a new `Orchestration` row in the database + * Mark the new node as `pending` highstate in the database + * Launch the `kubernetes` orchestration using `salt-api` + * If the orchestration is successful + * Mark all `pending` highstate nodes as `applied` in the database + * Otherwise + * Mark all `pending` highstate nodes as `failed` in the database + * Update the `Orchestration` row + +Velum is an application that saves its state in a MariaDB database living in +the admin node. + +### Problem description (detailed) + +There are different components to integrate between Velum and Salt: + +* Pillar override +* Event notification + +#### Pillar override + +The pillar override is what allows us to deploy the Kubernetes cluster with the +user provided settings. Our salt states will read these pillar settings, and +some of those pillars will be overriden with the settings provided by the user. + +The pillar override has been historically (version 1 and 2) included as a table +in the MariaDB database used by Velum. Velum would then write certain attributes +in form of rows, that would allow us to [override default pillar values](https://github.com/kubic-project/salt/blob/9832af2af236beb44603996cfc8af26cee9f7d26/config/master.d/50-returner.conf#L10-L111) when deploying +and maintaining a cluster. + +This database override soon fell short in functionality. We couldn't override +pillar values with nested structures, and we also had a limit in the depth of +nested keys that we could override. + +This problem was solved by the inclusion of a [pillar API served by Velum](https://github.com/kubic-project/salt/blob/0e80af55460b9135fa333eeae28d8750e8f75039/config/master.d/50-master.conf#L38-L41) +in v3. Since this API is serving structured content, it's trivial for us to +represent the pillar override as we want, and we moved away from the flat table +override that we had and the coupling between salt and MariaDB in this regard. + +#### Event notification + +The event notification is the way Velum gets notified about the outcome of a +relevant task that was triggered in salt (e.g. an orchestration result, a +highstate event, a new minion is asking for authorization, or a new minion +has been accepted by the `salt-master`). + +In all versions we save the salt events in the database, [in another dedicated +table](https://github.com/kubic-project/salt/blob/0e80af55460b9135fa333eeae28d8750e8f75039/config/master.d/50-returner.conf). + +The current solution is to have an event processor in an infinite loop that +analyzes [all events in the database](https://github.com/kubic-project/velum/blob/e3ffd14b3f7c9a4baeed4109ea5161022d94ae63/app/models/salt_event.rb#L37-L48) +and runs some [salt handlers](https://github.com/kubic-project/velum/tree/e3ffd14b3f7c9a4baeed4109ea5161022d94ae63/app/models/salt_handler) +depending on what was found, and with what parameters. + +Experience has taught us that there are issues with this approach: + +1. We are saving an incredible amount of uninteresting events in the database, + making debugging cumbersome. + + 1. Salt generates events for every action, and few of all those events are + really of interest for Velum. + +1. Retrieving the supportconfig from the admin node specially is memory and IO + demanding, because of the huge number of uninteresting events. It's not + surprising to see this process dying with an OOM. + +1. Really big events will fail to be added to the database. There are lots of + ways to tweak MariaDB, but there are certain limits that will always be hit; + when the cluster grows, or if we make the salt output grow bigger for the + same cluster size. This has a truly bad consequence: we might miss + orchestration results (usually they are the bigger outputs), what leads to + the infamous "nodes keep spinning forever". + +The debugging situation could have been improved by setting +`EVENT_RETURN_WHITELIST` or `EVENT_RETURN_BLACKLIST` on the `salt-master` +configuration to avoid noise in our database. However, this would still not help +in missing events (the events that we have seen failing to be inserted have been +usually the important ones, like orchestration results). + +Another issue is the tight coupling between salt and Velum's storage. Both have +RW access to the database; even if they have different per-table permissions +and different username and passwords it's not a good design to allow salt access +the database directly. + +The event notification is the only reason why we cannot completely decouple salt +from MariaDB today. If we change the way events can be read by Velum, we could +make salt avoid accessing MariaDB completely. + +### Proposed change (Detailed) + +The current state is the following: + +``` ++-----------+ +-----------+ +------------+ +| | | | | | +| Velum |---->+ MariaDB +<----| Salt | +| | | | | | ++-----------+ +-----------+ +------------+ +``` + +The ideal solution would be to specify a custom returner when calling to the +orchestration, but this is not possible yet: [1](https://github.com/saltstack/salt/issues/44610), +[2](https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html#run). + +The schematic proposal is the following: + +``` ++-----------+ +-----------+ +------------+ +| | | | (1) | | +| MariaDB +<----| Velum +<----| Salt | +| | | | | | ++-----------+ +-----------+ +------------+ + +(1) Salt will use reactors to filter out the interesting events (e.g. +`salt/run/*/ret` or `salt/job/*/ret/*`). A reactor will trigger an HTTP query +to a specific Velum API based on the event that generated it, targeting +different endpoints. +``` + +In order to remove the Salt and MariaDB coupling, it's required to take into +account different events, to have the full functionality that we have nowadays +by going through all events in the database: + +* `salt/run/*/new`: orchestration triggers. + * Will filter out by `data["fun"] == "runner.state.orchestrate"` to only + report the orchestration triggers. +* `salt/run/*/ret`: orchestration results. + * Will filter out by `data["fun"] == "runner.state.orchestrate"` to only + report the orchestration results. +* `salt/job/*/ret/*`: highstate on a minion. + * Will filter out by `data["fun"] == "state.highstate"` to only report the + highstate events. +* `salt/auth`: a minion is asking to authenticate itself (needs to be accepted). + * This will be used to create "Pending acceptance" minions. +* `minion_start`: a minion is available (has been accepted). + * This will be used to create the `Minion` objects in the database if that + minion is not yet present. +* `salt/presence/change`: (optional) could be used to detect down/up minions. + +Making `salt` notify an API instead of directly write to a storage is +interesting, because it allows us to add extra logic to those endpoints, being +`Velum` the last decision maker on what to do about them: + +* It can modify the database to update certain attributes or create, modify or + delete certain rows. + +* It can notify the user about certain important actions: send an email if a + machine is down for more than 10 minutes and is not in an upgrade process. + +* It can write all important events in a given format in a "Notifications" + storage (e.g. a new table), that can be later shown to the user when logging + in. + +This also fits well with a future `checkpointing` procedure, in which `salt` +can notify `Velum` when it reaches certain points in the orchestrations, +providing more information to the user about the orchestration progress: + +* When bootstrapping, for example: + * `etcd has been set-up` + * `all kubernetes masters have been set up` + * `all kubernetes workers have been set up` + * `all internal services have been deployed in the cluster` + * `bootstrap is complete` + +The reason why this design fits well with this possible future `checkpointing` +procedure is that the orchestrations might just send certain custom event tags +when they reach certain points in the orchestrations, and those events can be +connected to reactors in the very same way that we are connecting the builtin +event tags. + +#### Missing events + +With the proposed changes it's still possible to miss events if Velum is down +when salt tries to perform the HTTP request to the API. However, this RFC also +includes a solution in this regard. + +The reactors can trigger any salt module logic, for example: the current +POC for the reactor uses `runner.http.query` to notify Velum's API. This runner +hasn't got any special logic to handle the case when it cannot perform the HTTP +request. However, it's possible to implement our own runner that does. Given we +implement a `runner.velum_http.query` that is a wrapper around +`runner.http.query` we could: + +* Save the event in the FS, or in the `salt-master` cache. +* Perform the HTTP query using `http.query` runner. + * If it succeeds, we remove the event from the FS or the cache. + * If it fails, we keep it. + +Aside from this, we can schedule a custom event that runs for example every N +minutes that checks the FS or the cache for these custom events older than N +minutes and runs the queries one by one, performing the same logic: removing +them after they have succeeded. + +This would solve the problem of missing events completely. + +#### Event size + +Since one of the issues with the current solution is that we are saving the +whole output of orchestrations in the database, and that is leading to different +kind of issues, the current RFC also proposes changes in this regard. + +The callbacks that salt perform against Velum on its HTTP endpoint will include +minimal information for each event, e.g.: + +* For an orchestration event + * The `jid` + * The event tag + * Explicit pillar information (if any -- e.g. on a removal orchestration, + includes the target) + * The orchestration name + * The timestamps (`started_at`, ...) + * Result status (did succeed?) + +This is the information that is going to be saved in the database. Operational +basic information that allows us to know through what processes the cluster went, +and what was the outcome of those events. + +Also, if we are interested in specific information about the event itself, it +would be possible to include more information extracting it from the whole +output on the reactor and sending that piece of information to the Velum API. + +Aside from this, the `salt-master` will keep the whole orchestration output in the +FS, with a name similar to `jid.txt`. This way, since Velum will save all the +orchestration historical we can inspect the latest Orchestrations, or the +Orchestrations in order and determine in which we are interested, so we can +read that specific `jid.txt` file that would be present as a plain text file on +the host filesystem on the admin when debugging an issue. + +It would be interesting to have a job that cleans up old orchestration files. + +The supportconfig for the admin node would also include this folder with the +rest of the files. + +### Current POC + +The current POC for reactors can be seen here: + +* [salt](https://github.com/kubic-project/salt/pull/209) +* [velum](https://github.com/kubic-project/velum/pull/299) +* [caasp-container-manifests](https://github.com/kubic-project/caasp-container-manifests/pull/100) +* [automation](https://github.com/kubic-project/automation/pull/117) + +Please, note that this POC does not include the logic described in the [Missing +events](#missing-events) section, as it just uses `runner.http.query` to notify +the Velum's API. + +### Dependencies + +This is a purely internal change, and it should not have any impact on other +components and/or teams. + +### Concerns and Unresolved Questions + +No known concerns or unresolved questions. + +## Alternatives + +An alternative to the missing events problem was approached in the past: writing +a reconciliation loop on Velum that would keep everything up-to-date. However, +some problems arised when exploring this solution. Special logic needs to +be maintained to detect specific cases, for example: + +* Detect missing minions + * List minions using `salt-api` + * Is there a missing minion on the database? Create it +* Detect missing highstates +* Detect missing orchestration results + * Specially critical: as querying the `salt-api` wouldn't return the whole + orchestration output afterwards, thus, not knowing the whole result of + it. + +While the reconciliation loop that was included in v3 to detect the update +status of machines is handy for that purpose, it's not that easy to maintain +parallel logic that can perform the same actions as the API endpoints, with +slightly different logic and all its side effects it can bring. + +This is why the solution presented in the [Missing events](#missing-events) +section looks superior to a reconciliation loop to detect missing events: +salt will just make sure that those events are really provided to Velum +(waiting for a successful HTTP response code), otherwise, it will be salt +the one retrying until Velum realizes the new events. + +## Revision History: + +| Date | Comment | +|:-----------|:--------------| +| 2018-06-27 | Initial Draft | diff --git a/rfc-template.md b/rfc-template.md index f034336..bf24d31 100644 --- a/rfc-template.md +++ b/rfc-template.md @@ -1,4 +1,4 @@ - Title +# Title | Field | Value | |:-------|:-----------| @@ -23,7 +23,12 @@ change once this change is implemented. In this section of the document the target audience is the dev team. Upon reading this section each engineer should have a rather clear picture of what -needs to be done in order to implement the described feature. +needs to be done in order to implement the described feature. + +### Problem description (Detailed) + +This section goes more into the detail of the Problem description described in +the introduction if it applies. ### Proposed change (Detailed)