Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIOS/Firmware concept #117

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
updated
Signed-off-by: Artem Bortnikov <artem.bortnikov@telekom.com>
aobort committed Oct 7, 2024
commit 6cd847d3611de6d5d1a66f877bf80eb81c9376c3
Binary file added docs/assets/serverbios-controller-flow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/serverfirmware-controller-flow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
270 changes: 161 additions & 109 deletions docs/proposals/01-firmware-update.md
Original file line number Diff line number Diff line change
@@ -19,23 +19,24 @@ reviewers:
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Custom resources](#custom-resources)
- [ServerBIOS](#serverbios)
- [ServerFirmware](#serverfirmware)
- [AvailableFirmware](#availablefirmware)
- [Firmware operator](#firmware-operator)
- [Controllers](#controllers)
- [configuration](#configuration)
- [server-controller](#server-firmware-controller)
- [webhooks](#admission-webhooks)
- [serverbios-controller](#serverbios-controller)
- [serverfirmware-controller](#serverfirmware-controller)
- [Alternatives](#alternatives)

## Summary

Linked issue: [#99 BIOS/Firmware Update](https://github.com/ironcore-dev/metal-operator/issues/99)
PoC implementation: [#138 PoC: BIOS version & settings management](https://github.com/ironcore-dev/metal-operator/pull/138) (includes only )

The following is a concept of a solution aimed to solve listed problems in regard to hardware servers' BIOS/Firmware updates.
The following sections guide through:

- Kubernetes API types, which represent servers' firmware state;
- Kubernetes operator, which reconciles these API types;
- Kubernetes controllers, which reconcile these API types;

Throughout this document, the words are used to define and the significance of particular requirements is capitalized:

@@ -47,10 +48,8 @@ Throughout this document, the words are used to define and the significance of p

Throughout this document, the following terminology is used:

- `firmware operator`: the application running as a workload in Kubernetes cluster, interacting with Kubernetes API. It reconciles custom resources (hereafter CR) related to servers' firmware update workflow;
- `update job`: the execution item, which runs concrete implementation of the BIOS/firmware update routine on target hardware server;
- `scan job`: the execution item, which runs concrete implementation for scanning of the firmware installed on target hardware server;
- `update strategy`: the path chosen to apply updates, e.g.: pre-built boot image with updates, docker image with baked updates, vendor-specific CLI tool, etc.;
- `controller`: the unit which watches for the particular Kubernetes resource and executes reconciliation logic;
- `job` or `job executor`: the execution item, that runs concrete implementation of a specific task on target hardware server. MIGHT be vendor-specific;

The approach described in below allows to separate the vendor-agnostic common workflow and the concrete update job implementations that might be vendor-specific.

@@ -67,8 +66,7 @@ The following list gives general design goals for BIOS/Firmware updates:

- the solution SHOULD be vendor-agnostic aside from concrete scan/update job implementation;
- the solution SHOULD allow automated hardware servers' firmware lifecycle maintaining;
- the solution MUST be extensible by the possibility of using plugins for update strategy;
- the solution MUST be extensible by the possibility of adding vendor-specific update job implementations;
- the solution MUST be extensible by the possibility of adding vendor-specific job implementations;
- the solution SHOULD be as kubernetes-native as possible;

### Non-Goals
@@ -77,33 +75,77 @@ The following list gives general design goals for BIOS/Firmware updates:

### Custom resources

The following CRs aimed to represent the current state of a particular server and available firmware versions for a particular manufacturer-model:
The following CRs aimed to represent the current state of a particular server:

- [ServerBIOS](#serverbios)
- [ServerFirmware](#serverfirmware)
- [AvailableFirmware](#availablefirmware)

All the following CRs MUST be cluster-scoped.
All these CRs MUST be cluster-scoped.

#### ServerBIOS

`ServerBIOS` CR represents the desired BIOS version and settings of concrete hardware server.
The `.spec` of this type contains:

- the reference to the `Server` object;
- desired BIOS version and BIOS settings;
- the duration in minutes after which information listed in object's `.status` considered to be outdated;

The `.status` of this type contains:

- information about the BIOS version and settings which are actually applied;
- the timestamp when this information was updated;
- a reference to the running scan/update job if any.

```yaml
apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerBIOS
metadata:
name: foo
spec:
scanPeriodMinutes: 30
serverRef:
name: bar
bios:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want to incorporate a vendor/manufacturer in this struct as well? I know the Server via the ref should have this information, but it might makes sense to have it here as well. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since update service will anyways request for server object - at least to get related bmc for access type and credentials, I think it's not necessary to store manufacturer and model in this resource. But I have no strong opinion on that, since it would not affect anything.

version: 1.0.0
settings: {}
status:
lastScanTime: 01-01-2001 00:00:00
bios:
version: 0.1.0
settings: {}
runningJob:
name: foobar
namespace: default
```

The target `Server` object MUST also contain the reference to the `ServerBIOS` object.
The `.status.bios.settings` map MUST contain only keys exist in `.spec.bios.settings` map.

#### ServerFirmware

`ServerFirmware` CR represents the desired state of concrete hardware server.
The `.spec` of this type references the `Server` object, reflects its `.status.bios` field into `.spec.bios` field and contains the list of firmwares desired to be installed.
The `.status` of this type contains information about the BIOS/firmware versions which are actually installed on the server.
Aside from that `.spec` contains the scan threshold and the `.status` contains last scan operation timestamp.
These two fields required to make decision whether the scanning for installed firmware is required or not.
The `ServerFirmware` object SHOULD be created along with corresponding `Server` object and MUST be unique across the cluster.
The `.spec` of this type contains:

- the reference to the `Server` object;
- the list of firmwares desired to be installed;
- the duration in minutes after which information listed in object's `.status` considered to be outdated;

The `.status` of this type contains:

- information about the firmware versions which are actually installed on the server;
- the timestamp when this information was updated;
- a reference to the running scan/update job if any.

```yaml
apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerFirmware
metadata:
name: foo
spec:
scanThreshold: 30m
scanPeriodMinutes: 30m
serverRef:
name: foo
bios:
version: 1.0.0
name: bar
firmwares:
- name: ssd
manufacturer: ACME Corp.
@@ -113,119 +155,129 @@ spec:
version: 2.0.0
status:
lastScanTime: 01-01-2001 01:00:00
bios:
version: 1.0.0
firmwares:
- name: ssd
manufacturer: ACME Corp.
version: 1.0.0
- name: nic
manufacturer: Intel
version: 2.0.0
runningJob:
name: foobar
namespace: default
```

#### AvailableFirmware
### Controllers

`AvailableFirmware` CR represents available firmware versions for a specific manufacturer-model.
The `.spec` of this type contains
- [configuration](#configuration)
- [serverbios-controller](#serverbios-controller) (reconciles `ServerBIOS` CR)
- [serverfirmware-controller](#serverfirmware-controller) (reconciles `ServerFirmware` CR)

- manufacturer
- model
- the desired number of versions to store
- the list of firmwares and their versions available for specified manufacturer-model pair
#### Configuration

Each entry represents the name of individual firmware and the list of available versions sorted in ascending order.
The maximum length of this list MUST NOT exceed the value defined in `.spec.versionsHistory`.
In case of automated objects creation, the `AvailableFirmware` object SHOULD be created as soon as a new manufacturer-model pair was discovered
The `AvailableFirmware` object MUST be unique across the cluster basing on manufacturer-model pair.
Solution MUST provide a flexible yet transparent way to configure job runners. The minimal configuration provided:

```yaml
apiVersion: metal.ironcore.dev/v1alpha1
kind: AvailableFirmware
metadata:
name: baz
spec:
manufacturer: Lenovo
model: 7x21
versionsHistory: 3
bios:
versions: [1.0.0]
firmwares:
- name: ssd
manufacturer: ACME Corp.
version: [1.0.0, 1.1.0, 1.2.0]
- name: nic
manufacturer: Intel
version: [1.5.0, 1.7.0, 2.0.0]
status: {}
```
- MUST include container image to be run as job executor;
- SHOULD include specific `ServiceAccount` reference to be used by job executor to get and update cluster resources;
- SHOULD include specific namespace in which job executors will run;
- MAY specify where to get updated versions to install;
- MAY include reference to specific configuration;

### Firmware operator
There are a number of approaches that can be used to provide the configuration.

This is an application that watches and reconciles CRs listed in the previous section.
It consists of the following controllers:
##### Command-line arguments

- [server-firmware-controller](#server-firmware-controller) (reconciles `ServerFirmware` CR)
Providing of necessary configuration using command-line arguments on controller's start.

#### Configuration
PROS:

Operator's configuration:
CONS:
- controller restart required to change configuration;
- implementation of complex configuration will lead to the mess in command-line args;

- MUST contain update strategy, i.e.:
- "BootFromImage", server boots from prepared boot image with update tool;
- "RedFish", updates are installed remotely using redfish API;
- etc.;
##### ConfigMap

Update strategy entries MUST be mutual exclusive;
- Update strategy entry MUST contain mapping for vendor and boot image, mapping for vendor and job executor image, etc., depending on strategy;
- MAY contain source of the bios/firmware updates;
Providing of necessary configuration using native Kubernetes `ConfigMap`.
`ConfigMap` SHOULD be referenced using command-line argument.

#### server-firmware-controller
PROS:
- reading configuration right before use allow the re-configuration without controller restart;

This controller reconciles `ServerFirmware` CR.
When an object of this kind is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanThreshold`.
Scan job MUST update corresponding `ServerFirmware` object's `.status` with installed firmware versions.
After the object becomes updated, the controller computes the difference between desired state defined in object's `.spec` and actual state reflected in object's `.status`.
If there is discrepancy observed between these two states, then `server-firmware-controller` MUST set **"Maintenance"** state for target server and invoke an update job.
After invoking any of the mentioned job types, `server-firmware-controller` MUST stop reconciliation by returning an empty result and an error if any, otherwise empty result and nil value.
Invoked jobs depend on chosen update strategy and its configuration provided to operator.

Reconciliation workflow when scan required:

```mermaid
sequenceDiagram
request ->>+reconciler: start reconciliation
reconciler ->>+scan-phase: check scan time
scan-phase ->>+invoke-job: scan time exceeded threshold
invoke-job ->>+job: run scan job
invoke-job -->>exit: stop reconciliation
job ->>-request: scan job completed and updates object
```
CONS:
- schemaless nature of `ConfigMap` data requires additional validation;
- forcing end-users to create `ConfigMap`'s with specific data format;

Reconciliation workflow when update required:

```mermaid
sequenceDiagram
request ->>+reconciler: start reconciliation
reconciler ->>+scan-phase: check scan time
scan-phase ->>-reconciler: scan time within threshold
reconciler ->>+update-phase: compare spec and status
update-phase ->>+invoke-job: discrepancy observed
invoke-job ->>+job: run update job
invoke-job -->>exit: stop reconciliation
job ->>-request: update job completed and updates object
```
##### CustomResource

Providing of necessary configuration using custom resource.

PROS:
- reading configuration right before use allow the re-configuration without controller restart;
- easy to implement validation;
- leveraging of built-in Kubernetes mechanisms, like label selectors, can be used for mapping between configuration and server bios/firmware objects;

CONS:
- necessity to maintain API versions;

#### Admission webhooks
#### serverbios-controller

Firmware operator SHOULD implement validating webhooks for provided CRs.
Webhook for `AvailableFirmware` MUST validate:
This controller reconciles `ServerBIOS` CR.
When a `ServerBIOS` object is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanPeriodMinutes`.
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job.
Scan job MUST update corresponding `ServerBIOS` object's status on completion:

- on CREATE that objects to be created are unique across the cluster;
- update `.status.bios.version` field
- update `.status.bios.settings` field
- update `.status.lastScanTime` field
- remove reference to job in `.status.runningJob` field

Webhook for `ServerFirmware` MUST validate:
When an object contains up-to-date info in `.status.bios` field, the controller MUST check whether the target `Server` is in "Available" state.
If the server is not in "Available" state, then reconciliation stops.
Otherwise, the controller MUST compare the desired and current BIOS versions stored in `.spec.bios.version` and `.status.bios.version` fields accordingly.
If BIOS versions do not match, the controller MUST invoke BIOS version update job.
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job.
BIOS version update job MUST update corresponding `ServerBIOS` object's status on completion:

- on CREATE that objects to be created are unique across the cluster;
- on UPDATE that object's spec contains only bios/firmware versions listed in corresponding `AvailableFirmware` object;
- update `.status.bios.version` field
- remove reference to job in `.status.runningJob` field

When a `ServerBIOS` object's desired and current BIOS versions match, the controller MUST compare the desired and current BIOS settings stored in `.spec.bios.settings` and `.status.bios.settings` fields accordingly.
If there is discrepancy between desired and current settings, the controller MUST invoke BIOS settings update job.
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job.
BIOS settings update job MUST update corresponding `ServerBIOS` object's status on completion:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use this to pause upgrades if we find out that firmware has issues and investigation is needed before proceeding with next servers OR issues in cluster are noticed and we want to halt the upgrade process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the goal for proposed servers grouping - first run updates on dedicated test group. If all is ok, then run updates on prod servers.

- update `.status.bios.settings` field
- remove reference to job in `.status.runningJob` field

<details>
<summary>Reconciliation flow diagram</summary>
<img src="../assets/serverbios-controller-flow.jpg" width="50%">
</details>

#### serverfirmware-controller

This controller reconciles `ServerFirmware` CR.
When a `ServerFirmware` object is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanPeriodMinutes`.
When the `Job` object is created, the controller MUST update `ServerFirmware` object's `.status.runningJob` field with the reference to created job.
Scan job MUST update corresponding `ServerFirmware` object's status on completion:

- update `.status.firmwares` field
- update `.status.lastScanTime` field
- remove reference to job in `.status.runningJob` field

When an object contains up-to-date info in `.status.firmwares` field, the controller MUST check whether the target `Server` is in "Available" state.
If the server is not in "Available" state, then reconciliation stops.
Otherwise, the controller MUST compare the desired and current firmware versions stored in `.spec.firmwares` and `.status.firmwares` fields accordingly.
If there is discrepancy between desired and current firmware versions, the controller MUST invoke firmware update job.
When the `Job` object is created, the controller MUST update `ServerFirmware` object's `.status.runningJob` field with the reference to created job.
Firmware versions update job MUST update corresponding `ServerFirmware` object's status on completion:

- update `.status.firmwares` field
- remove reference to job in `.status.runningJob` field

<details>
<summary>Reconciliation flow diagram</summary>
<img src="../assets/serverfirmware-controller-flow.jpg" width="50%">
</details>

## Alternatives