Skip to content
This repository has been archived by the owner on Mar 25, 2019. It is now read-only.

Add Velum-Salt integration RFC #3

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
349 changes: 349 additions & 0 deletions 2018/002-salt-velum-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
# Salt and Velum integration

| Field | Value |
|:-------|:-----------|
| Status | Draft |
| Date | 2018-06-27 |

## Introduction

Salt and Velum need to integrate in order to trigger actions and notify users
about the outcome of those actions.

### Problem description

Salt sometimes cannot insert some events into the database because they are too
big to fit in a single transaction. This leads to Velum missing these events
and never reacting to them (never marking nodes as successfully bootstrapped,
for example).

Debugging is also cumbersome because of the big amount of events present in
the database, when most of them are of no interest for debugging purposes.

### Proposed change

Instead of having salt interact with MariaDB directly it will notify Velum
using a dedicated API; only the interesting events will be notified.

## Detailed RFC

Examples of actions that need to be triggered from within Velum are (but not
limited to):

* Launch bootstrap orchestration that creates the cluster for the first time
* Node management
* Node addition
* Node removal
* Modify settings and apply those settings to the cluster
* Upgrade cluster

All these actions correspond to salt orchestrations. Velum will trigger the
orchestrations with the right parameters using the `salt-api` component. After
that, we need to read the result of those orchestrations somehow, because the
orchestration launch is an asynchronous event, and we cannot block Velum until
the orchestration result is available.

The result of the mentioned orchestrations is what will allow Velum to perform
certain changes on the local database, for example:

* When a node is removed
* Create a new `Orchestration` row in the database
* Mark the node to be removed as `pending` highstate in the database
* Launch the `removal` orchestration using `salt-api`
* If the orchestration is successful
* Read the parameters of the launched orchestration and remove that node from
the database
* Otherwise
* Read the parameters of the launched orchestration and mark that node as
failed in the database
* Update the `Orchestration` row

* When a node is added
* Set the new node role using `salt-api` (e.g. `kube-master`)
* Create a new `Orchestration` row in the database
* Mark the new node as `pending` highstate in the database
* Launch the `kubernetes` orchestration using `salt-api`
* If the orchestration is successful
* Mark all `pending` highstate nodes as `applied` in the database
* Otherwise
* Mark all `pending` highstate nodes as `failed` in the database
* Update the `Orchestration` row

Velum is an application that saves its state in a MariaDB database living in
the admin node.

### Problem description (detailed)

There are different components to integrate between Velum and Salt:

* Pillar override
* Event notification

#### Pillar override

The pillar override is what allows us to deploy the Kubernetes cluster with the
user provided settings. Our salt states will read these pillar settings, and
some of those pillars will be overriden with the settings provided by the user.

The pillar override has been historically (version 1 and 2) included as a table
in the MariaDB database used by Velum. Velum would then write certain attributes
in form of rows, that would allow us to [override default pillar values](https://github.com/kubic-project/salt/blob/9832af2af236beb44603996cfc8af26cee9f7d26/config/master.d/50-returner.conf#L10-L111) when deploying
and maintaining a cluster.

This database override soon fell short in functionality. We couldn't override
pillar values with nested structures, and we also had a limit in the depth of
nested keys that we could override.

This problem was solved by the inclusion of a [pillar API served by Velum](https://github.com/kubic-project/salt/blob/0e80af55460b9135fa333eeae28d8750e8f75039/config/master.d/50-master.conf#L38-L41)
in v3. Since this API is serving structured content, it's trivial for us to
represent the pillar override as we want, and we moved away from the flat table
override that we had and the coupling between salt and MariaDB in this regard.

#### Event notification

The event notification is the way Velum gets notified about the outcome of a
relevant task that was triggered in salt (e.g. an orchestration result, a
highstate event, a new minion is asking for authorization, or a new minion
has been accepted by the `salt-master`).

In all versions we save the salt events in the database, [in another dedicated
table](https://github.com/kubic-project/salt/blob/0e80af55460b9135fa333eeae28d8750e8f75039/config/master.d/50-returner.conf).

The current solution is to have an event processor in an infinite loop that
analyzes [all events in the database](https://github.com/kubic-project/velum/blob/e3ffd14b3f7c9a4baeed4109ea5161022d94ae63/app/models/salt_event.rb#L37-L48)
and runs some [salt handlers](https://github.com/kubic-project/velum/tree/e3ffd14b3f7c9a4baeed4109ea5161022d94ae63/app/models/salt_handler)
depending on what was found, and with what parameters.

Experience has taught us that there are issues with this approach:

1. We are saving an incredible amount of uninteresting events in the database,
making debugging cumbersome.

1. Salt generates events for every action, and few of all those events are
really of interest for Velum.

1. Retrieving the supportconfig from the admin node specially is memory and IO
demanding, because of the huge number of uninteresting events. It's not
surprising to see this process dying with an OOM.

1. Really big events will fail to be added to the database. There are lots of
ways to tweak MariaDB, but there are certain limits that will always be hit;
when the cluster grows, or if we make the salt output grow bigger for the
same cluster size. This has a truly bad consequence: we might miss
orchestration results (usually they are the bigger outputs), what leads to
the infamous "nodes keep spinning forever".

The debugging situation could have been improved by setting
`EVENT_RETURN_WHITELIST` or `EVENT_RETURN_BLACKLIST` on the `salt-master`
configuration to avoid noise in our database. However, this would still not help
in missing events (the events that we have seen failing to be inserted have been
usually the important ones, like orchestration results).

Another issue is the tight coupling between salt and Velum's storage. Both have
RW access to the database; even if they have different per-table permissions
and different username and passwords it's not a good design to allow salt access
the database directly.

The event notification is the only reason why we cannot completely decouple salt
from MariaDB today. If we change the way events can be read by Velum, we could
make salt avoid accessing MariaDB completely.

### Proposed change (Detailed)

The current state is the following:

```
+-----------+ +-----------+ +------------+
| | | | | |
| Velum |---->+ MariaDB +<----| Salt |
| | | | | |
+-----------+ +-----------+ +------------+
```

The ideal solution would be to specify a custom returner when calling to the
orchestration, but this is not possible yet: [1](https://github.com/saltstack/salt/issues/44610),
[2](https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html#run).

The schematic proposal is the following:

```
+-----------+ +-----------+ +------------+
| | | | (1) | |
| MariaDB +<----| Velum +<----| Salt |
| | | | | |
+-----------+ +-----------+ +------------+

(1) Salt will use reactors to filter out the interesting events (e.g.
`salt/run/*/ret` or `salt/job/*/ret/*`). A reactor will trigger an HTTP query
to a specific Velum API based on the event that generated it, targeting
different endpoints.
```

In order to remove the Salt and MariaDB coupling, it's required to take into
account different events, to have the full functionality that we have nowadays
by going through all events in the database:

* `salt/run/*/new`: orchestration triggers.
* Will filter out by `data["fun"] == "runner.state.orchestrate"` to only
report the orchestration triggers.
* `salt/run/*/ret`: orchestration results.
* Will filter out by `data["fun"] == "runner.state.orchestrate"` to only
report the orchestration results.
* `salt/job/*/ret/*`: highstate on a minion.
* Will filter out by `data["fun"] == "state.highstate"` to only report the
highstate events.
* `salt/auth`: a minion is asking to authenticate itself (needs to be accepted).
* This will be used to create "Pending acceptance" minions.
* `minion_start`: a minion is available (has been accepted).
* This will be used to create the `Minion` objects in the database if that
minion is not yet present.
* `salt/presence/change`: (optional) could be used to detect down/up minions.

Making `salt` notify an API instead of directly write to a storage is
interesting, because it allows us to add extra logic to those endpoints, being
`Velum` the last decision maker on what to do about them:

* It can modify the database to update certain attributes or create, modify or
delete certain rows.

* It can notify the user about certain important actions: send an email if a
machine is down for more than 10 minutes and is not in an upgrade process.

* It can write all important events in a given format in a "Notifications"
storage (e.g. a new table), that can be later shown to the user when logging
in.

This also fits well with a future `checkpointing` procedure, in which `salt`
can notify `Velum` when it reaches certain points in the orchestrations,
providing more information to the user about the orchestration progress:

* When bootstrapping, for example:
* `etcd has been set-up`
* `all kubernetes masters have been set up`
* `all kubernetes workers have been set up`
* `all internal services have been deployed in the cluster`
* `bootstrap is complete`

The reason why this design fits well with this possible future `checkpointing`
procedure is that the orchestrations might just send certain custom event tags
when they reach certain points in the orchestrations, and those events can be
connected to reactors in the very same way that we are connecting the builtin
event tags.

#### Missing events

With the proposed changes it's still possible to miss events if Velum is down
when salt tries to perform the HTTP request to the API. However, this RFC also
includes a solution in this regard.

The reactors can trigger any salt module logic, for example: the current
POC for the reactor uses `runner.http.query` to notify Velum's API. This runner
hasn't got any special logic to handle the case when it cannot perform the HTTP
request. However, it's possible to implement our own runner that does. Given we
implement a `runner.velum_http.query` that is a wrapper around
`runner.http.query` we could:

* Save the event in the FS, or in the `salt-master` cache.
* Perform the HTTP query using `http.query` runner.
* If it succeeds, we remove the event from the FS or the cache.
* If it fails, we keep it.

Aside from this, we can schedule a custom event that runs for example every N
minutes that checks the FS or the cache for these custom events older than N
minutes and runs the queries one by one, performing the same logic: removing
them after they have succeeded.

This would solve the problem of missing events completely.

#### Event size

Since one of the issues with the current solution is that we are saving the
whole output of orchestrations in the database, and that is leading to different
kind of issues, the current RFC also proposes changes in this regard.

The callbacks that salt perform against Velum on its HTTP endpoint will include
minimal information for each event, e.g.:

* For an orchestration event
* The `jid`
* The event tag
* Explicit pillar information (if any -- e.g. on a removal orchestration,
includes the target)
* The orchestration name
* The timestamps (`started_at`, ...)
* Result status (did succeed?)

This is the information that is going to be saved in the database. Operational
basic information that allows us to know through what processes the cluster went,
and what was the outcome of those events.

Also, if we are interested in specific information about the event itself, it
would be possible to include more information extracting it from the whole
output on the reactor and sending that piece of information to the Velum API.

Aside from this, the `salt-master` will keep the whole orchestration output in the
FS, with a name similar to `jid.txt`. This way, since Velum will save all the
orchestration historical we can inspect the latest Orchestrations, or the
Orchestrations in order and determine in which we are interested, so we can
read that specific `jid.txt` file that would be present as a plain text file on
the host filesystem on the admin when debugging an issue.

It would be interesting to have a job that cleans up old orchestration files.

The supportconfig for the admin node would also include this folder with the
rest of the files.

### Current POC

The current POC for reactors can be seen here:

* [salt](https://github.com/kubic-project/salt/pull/209)
* [velum](https://github.com/kubic-project/velum/pull/299)
* [caasp-container-manifests](https://github.com/kubic-project/caasp-container-manifests/pull/100)
* [automation](https://github.com/kubic-project/automation/pull/117)

Please, note that this POC does not include the logic described in the [Missing
events](#missing-events) section, as it just uses `runner.http.query` to notify
the Velum's API.

### Dependencies

This is a purely internal change, and it should not have any impact on other
components and/or teams.

### Concerns and Unresolved Questions

No known concerns or unresolved questions.

## Alternatives

An alternative to the missing events problem was approached in the past: writing
a reconciliation loop on Velum that would keep everything up-to-date. However,
some problems arised when exploring this solution. Special logic needs to
be maintained to detect specific cases, for example:

* Detect missing minions
* List minions using `salt-api`
* Is there a missing minion on the database? Create it
* Detect missing highstates
* Detect missing orchestration results
* Specially critical: as querying the `salt-api` wouldn't return the whole
orchestration output afterwards, thus, not knowing the whole result of
it.

While the reconciliation loop that was included in v3 to detect the update
status of machines is handy for that purpose, it's not that easy to maintain
parallel logic that can perform the same actions as the API endpoints, with
slightly different logic and all its side effects it can bring.

This is why the solution presented in the [Missing events](#missing-events)
section looks superior to a reconciliation loop to detect missing events:
salt will just make sure that those events are really provided to Velum
(waiting for a successful HTTP response code), otherwise, it will be salt
the one retrying until Velum realizes the new events.

## Revision History:

| Date | Comment |
|:-----------|:--------------|
| 2018-06-27 | Initial Draft |
9 changes: 7 additions & 2 deletions rfc-template.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Title
# Title

| Field | Value |
|:-------|:-----------|
Expand All @@ -23,7 +23,12 @@ change once this change is implemented.

In this section of the document the target audience is the dev team. Upon
reading this section each engineer should have a rather clear picture of what
needs to be done in order to implement the described feature.
needs to be done in order to implement the described feature.

### Problem description (Detailed)

This section goes more into the detail of the Problem description described in
the introduction if it applies.

### Proposed change (Detailed)

Expand Down