-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIOS/Firmware concept #117
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
Signed-off-by: Artem Bortnikov <artem.bortnikov@telekom.com>
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,23 +19,24 @@ reviewers: | |
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Custom resources](#custom-resources) | ||
- [ServerBIOS](#serverbios) | ||
- [ServerFirmware](#serverfirmware) | ||
- [AvailableFirmware](#availablefirmware) | ||
- [Firmware operator](#firmware-operator) | ||
- [Controllers](#controllers) | ||
- [configuration](#configuration) | ||
- [server-controller](#server-firmware-controller) | ||
- [webhooks](#admission-webhooks) | ||
- [serverbios-controller](#serverbios-controller) | ||
- [serverfirmware-controller](#serverfirmware-controller) | ||
- [Alternatives](#alternatives) | ||
|
||
## Summary | ||
|
||
Linked issue: [#99 BIOS/Firmware Update](https://github.com/ironcore-dev/metal-operator/issues/99) | ||
PoC implementation: [#138 PoC: BIOS version & settings management](https://github.com/ironcore-dev/metal-operator/pull/138) (includes only ) | ||
|
||
The following is a concept of a solution aimed to solve listed problems in regard to hardware servers' BIOS/Firmware updates. | ||
The following sections guide through: | ||
|
||
- Kubernetes API types, which represent servers' firmware state; | ||
- Kubernetes operator, which reconciles these API types; | ||
- Kubernetes controllers, which reconcile these API types; | ||
|
||
Throughout this document, the words are used to define and the significance of particular requirements is capitalized: | ||
|
||
|
@@ -47,10 +48,8 @@ Throughout this document, the words are used to define and the significance of p | |
|
||
Throughout this document, the following terminology is used: | ||
|
||
- `firmware operator`: the application running as a workload in Kubernetes cluster, interacting with Kubernetes API. It reconciles custom resources (hereafter CR) related to servers' firmware update workflow; | ||
- `update job`: the execution item, which runs concrete implementation of the BIOS/firmware update routine on target hardware server; | ||
- `scan job`: the execution item, which runs concrete implementation for scanning of the firmware installed on target hardware server; | ||
- `update strategy`: the path chosen to apply updates, e.g.: pre-built boot image with updates, docker image with baked updates, vendor-specific CLI tool, etc.; | ||
- `controller`: the unit which watches for the particular Kubernetes resource and executes reconciliation logic; | ||
- `job` or `job executor`: the execution item, that runs concrete implementation of a specific task on target hardware server. MIGHT be vendor-specific; | ||
|
||
The approach described in below allows to separate the vendor-agnostic common workflow and the concrete update job implementations that might be vendor-specific. | ||
|
||
|
@@ -67,8 +66,7 @@ The following list gives general design goals for BIOS/Firmware updates: | |
|
||
- the solution SHOULD be vendor-agnostic aside from concrete scan/update job implementation; | ||
- the solution SHOULD allow automated hardware servers' firmware lifecycle maintaining; | ||
- the solution MUST be extensible by the possibility of using plugins for update strategy; | ||
- the solution MUST be extensible by the possibility of adding vendor-specific update job implementations; | ||
- the solution MUST be extensible by the possibility of adding vendor-specific job implementations; | ||
- the solution SHOULD be as kubernetes-native as possible; | ||
|
||
### Non-Goals | ||
|
@@ -77,33 +75,77 @@ The following list gives general design goals for BIOS/Firmware updates: | |
|
||
### Custom resources | ||
|
||
The following CRs aimed to represent the current state of a particular server and available firmware versions for a particular manufacturer-model: | ||
The following CRs aimed to represent the current state of a particular server: | ||
|
||
- [ServerBIOS](#serverbios) | ||
- [ServerFirmware](#serverfirmware) | ||
- [AvailableFirmware](#availablefirmware) | ||
|
||
All the following CRs MUST be cluster-scoped. | ||
All these CRs MUST be cluster-scoped. | ||
|
||
#### ServerBIOS | ||
|
||
`ServerBIOS` CR represents the desired BIOS version and settings of concrete hardware server. | ||
The `.spec` of this type contains: | ||
|
||
- the reference to the `Server` object; | ||
- desired BIOS version and BIOS settings; | ||
- the duration in minutes after which information listed in object's `.status` considered to be outdated; | ||
|
||
The `.status` of this type contains: | ||
|
||
- information about the BIOS version and settings which are actually applied; | ||
- the timestamp when this information was updated; | ||
- a reference to the running scan/update job if any. | ||
|
||
```yaml | ||
apiVersion: metal.ironcore.dev/v1alpha1 | ||
kind: ServerBIOS | ||
metadata: | ||
name: foo | ||
spec: | ||
scanPeriodMinutes: 30 | ||
serverRef: | ||
name: bar | ||
bios: | ||
version: 1.0.0 | ||
settings: {} | ||
status: | ||
lastScanTime: 01-01-2001 00:00:00 | ||
bios: | ||
version: 0.1.0 | ||
settings: {} | ||
runningJob: | ||
name: foobar | ||
namespace: default | ||
``` | ||
|
||
The target `Server` object MUST also contain the reference to the `ServerBIOS` object. | ||
The `.status.bios.settings` map MUST contain only keys exist in `.spec.bios.settings` map. | ||
|
||
#### ServerFirmware | ||
|
||
`ServerFirmware` CR represents the desired state of concrete hardware server. | ||
The `.spec` of this type references the `Server` object, reflects its `.status.bios` field into `.spec.bios` field and contains the list of firmwares desired to be installed. | ||
The `.status` of this type contains information about the BIOS/firmware versions which are actually installed on the server. | ||
Aside from that `.spec` contains the scan threshold and the `.status` contains last scan operation timestamp. | ||
These two fields required to make decision whether the scanning for installed firmware is required or not. | ||
The `ServerFirmware` object SHOULD be created along with corresponding `Server` object and MUST be unique across the cluster. | ||
The `.spec` of this type contains: | ||
|
||
- the reference to the `Server` object; | ||
- the list of firmwares desired to be installed; | ||
- the duration in minutes after which information listed in object's `.status` considered to be outdated; | ||
|
||
The `.status` of this type contains: | ||
|
||
- information about the firmware versions which are actually installed on the server; | ||
- the timestamp when this information was updated; | ||
- a reference to the running scan/update job if any. | ||
|
||
```yaml | ||
apiVersion: metal.ironcore.dev/v1alpha1 | ||
kind: ServerFirmware | ||
metadata: | ||
name: foo | ||
spec: | ||
scanThreshold: 30m | ||
scanPeriodMinutes: 30m | ||
serverRef: | ||
name: foo | ||
bios: | ||
version: 1.0.0 | ||
name: bar | ||
firmwares: | ||
- name: ssd | ||
manufacturer: ACME Corp. | ||
|
@@ -113,119 +155,129 @@ spec: | |
version: 2.0.0 | ||
status: | ||
lastScanTime: 01-01-2001 01:00:00 | ||
bios: | ||
version: 1.0.0 | ||
firmwares: | ||
- name: ssd | ||
manufacturer: ACME Corp. | ||
version: 1.0.0 | ||
- name: nic | ||
manufacturer: Intel | ||
version: 2.0.0 | ||
runningJob: | ||
name: foobar | ||
namespace: default | ||
``` | ||
|
||
#### AvailableFirmware | ||
### Controllers | ||
|
||
`AvailableFirmware` CR represents available firmware versions for a specific manufacturer-model. | ||
The `.spec` of this type contains | ||
- [configuration](#configuration) | ||
- [serverbios-controller](#serverbios-controller) (reconciles `ServerBIOS` CR) | ||
- [serverfirmware-controller](#serverfirmware-controller) (reconciles `ServerFirmware` CR) | ||
|
||
- manufacturer | ||
- model | ||
- the desired number of versions to store | ||
- the list of firmwares and their versions available for specified manufacturer-model pair | ||
#### Configuration | ||
|
||
Each entry represents the name of individual firmware and the list of available versions sorted in ascending order. | ||
The maximum length of this list MUST NOT exceed the value defined in `.spec.versionsHistory`. | ||
In case of automated objects creation, the `AvailableFirmware` object SHOULD be created as soon as a new manufacturer-model pair was discovered | ||
The `AvailableFirmware` object MUST be unique across the cluster basing on manufacturer-model pair. | ||
Solution MUST provide a flexible yet transparent way to configure job runners. The minimal configuration provided: | ||
|
||
```yaml | ||
apiVersion: metal.ironcore.dev/v1alpha1 | ||
kind: AvailableFirmware | ||
metadata: | ||
name: baz | ||
spec: | ||
manufacturer: Lenovo | ||
model: 7x21 | ||
versionsHistory: 3 | ||
bios: | ||
versions: [1.0.0] | ||
firmwares: | ||
- name: ssd | ||
manufacturer: ACME Corp. | ||
version: [1.0.0, 1.1.0, 1.2.0] | ||
- name: nic | ||
manufacturer: Intel | ||
version: [1.5.0, 1.7.0, 2.0.0] | ||
status: {} | ||
``` | ||
- MUST include container image to be run as job executor; | ||
- SHOULD include specific `ServiceAccount` reference to be used by job executor to get and update cluster resources; | ||
- SHOULD include specific namespace in which job executors will run; | ||
- MAY specify where to get updated versions to install; | ||
- MAY include reference to specific configuration; | ||
|
||
### Firmware operator | ||
There are a number of approaches that can be used to provide the configuration. | ||
|
||
This is an application that watches and reconciles CRs listed in the previous section. | ||
It consists of the following controllers: | ||
##### Command-line arguments | ||
|
||
- [server-firmware-controller](#server-firmware-controller) (reconciles `ServerFirmware` CR) | ||
Providing of necessary configuration using command-line arguments on controller's start. | ||
|
||
#### Configuration | ||
PROS: | ||
|
||
Operator's configuration: | ||
CONS: | ||
- controller restart required to change configuration; | ||
- implementation of complex configuration will lead to the mess in command-line args; | ||
|
||
- MUST contain update strategy, i.e.: | ||
- "BootFromImage", server boots from prepared boot image with update tool; | ||
- "RedFish", updates are installed remotely using redfish API; | ||
- etc.; | ||
##### ConfigMap | ||
|
||
Update strategy entries MUST be mutual exclusive; | ||
- Update strategy entry MUST contain mapping for vendor and boot image, mapping for vendor and job executor image, etc., depending on strategy; | ||
- MAY contain source of the bios/firmware updates; | ||
Providing of necessary configuration using native Kubernetes `ConfigMap`. | ||
`ConfigMap` SHOULD be referenced using command-line argument. | ||
|
||
#### server-firmware-controller | ||
PROS: | ||
- reading configuration right before use allow the re-configuration without controller restart; | ||
|
||
This controller reconciles `ServerFirmware` CR. | ||
When an object of this kind is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanThreshold`. | ||
Scan job MUST update corresponding `ServerFirmware` object's `.status` with installed firmware versions. | ||
After the object becomes updated, the controller computes the difference between desired state defined in object's `.spec` and actual state reflected in object's `.status`. | ||
If there is discrepancy observed between these two states, then `server-firmware-controller` MUST set **"Maintenance"** state for target server and invoke an update job. | ||
After invoking any of the mentioned job types, `server-firmware-controller` MUST stop reconciliation by returning an empty result and an error if any, otherwise empty result and nil value. | ||
Invoked jobs depend on chosen update strategy and its configuration provided to operator. | ||
|
||
Reconciliation workflow when scan required: | ||
|
||
```mermaid | ||
sequenceDiagram | ||
request ->>+reconciler: start reconciliation | ||
reconciler ->>+scan-phase: check scan time | ||
scan-phase ->>+invoke-job: scan time exceeded threshold | ||
invoke-job ->>+job: run scan job | ||
invoke-job -->>exit: stop reconciliation | ||
job ->>-request: scan job completed and updates object | ||
``` | ||
CONS: | ||
- schemaless nature of `ConfigMap` data requires additional validation; | ||
- forcing end-users to create `ConfigMap`'s with specific data format; | ||
|
||
Reconciliation workflow when update required: | ||
|
||
```mermaid | ||
sequenceDiagram | ||
request ->>+reconciler: start reconciliation | ||
reconciler ->>+scan-phase: check scan time | ||
scan-phase ->>-reconciler: scan time within threshold | ||
reconciler ->>+update-phase: compare spec and status | ||
update-phase ->>+invoke-job: discrepancy observed | ||
invoke-job ->>+job: run update job | ||
invoke-job -->>exit: stop reconciliation | ||
job ->>-request: update job completed and updates object | ||
``` | ||
##### CustomResource | ||
|
||
Providing of necessary configuration using custom resource. | ||
|
||
PROS: | ||
- reading configuration right before use allow the re-configuration without controller restart; | ||
- easy to implement validation; | ||
- leveraging of built-in Kubernetes mechanisms, like label selectors, can be used for mapping between configuration and server bios/firmware objects; | ||
|
||
CONS: | ||
- necessity to maintain API versions; | ||
|
||
#### Admission webhooks | ||
#### serverbios-controller | ||
|
||
Firmware operator SHOULD implement validating webhooks for provided CRs. | ||
Webhook for `AvailableFirmware` MUST validate: | ||
This controller reconciles `ServerBIOS` CR. | ||
When a `ServerBIOS` object is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanPeriodMinutes`. | ||
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job. | ||
Scan job MUST update corresponding `ServerBIOS` object's status on completion: | ||
|
||
- on CREATE that objects to be created are unique across the cluster; | ||
- update `.status.bios.version` field | ||
- update `.status.bios.settings` field | ||
- update `.status.lastScanTime` field | ||
- remove reference to job in `.status.runningJob` field | ||
|
||
Webhook for `ServerFirmware` MUST validate: | ||
When an object contains up-to-date info in `.status.bios` field, the controller MUST check whether the target `Server` is in "Available" state. | ||
If the server is not in "Available" state, then reconciliation stops. | ||
Otherwise, the controller MUST compare the desired and current BIOS versions stored in `.spec.bios.version` and `.status.bios.version` fields accordingly. | ||
If BIOS versions do not match, the controller MUST invoke BIOS version update job. | ||
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job. | ||
BIOS version update job MUST update corresponding `ServerBIOS` object's status on completion: | ||
|
||
- on CREATE that objects to be created are unique across the cluster; | ||
- on UPDATE that object's spec contains only bios/firmware versions listed in corresponding `AvailableFirmware` object; | ||
- update `.status.bios.version` field | ||
- remove reference to job in `.status.runningJob` field | ||
|
||
When a `ServerBIOS` object's desired and current BIOS versions match, the controller MUST compare the desired and current BIOS settings stored in `.spec.bios.settings` and `.status.bios.settings` fields accordingly. | ||
If there is discrepancy between desired and current settings, the controller MUST invoke BIOS settings update job. | ||
When the `Job` object is created, the controller MUST update `ServerBIOS` object's `.status.runningJob` field with the reference to created job. | ||
BIOS settings update job MUST update corresponding `ServerBIOS` object's status on completion: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we use this to pause upgrades if we find out that firmware has issues and investigation is needed before proceeding with next servers OR issues in cluster are noticed and we want to halt the upgrade process. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the goal for proposed servers grouping - first run updates on dedicated test group. If all is ok, then run updates on prod servers. |
||
- update `.status.bios.settings` field | ||
- remove reference to job in `.status.runningJob` field | ||
|
||
<details> | ||
<summary>Reconciliation flow diagram</summary> | ||
<img src="../assets/serverbios-controller-flow.jpg" width="50%"> | ||
</details> | ||
|
||
#### serverfirmware-controller | ||
|
||
This controller reconciles `ServerFirmware` CR. | ||
When a `ServerFirmware` object is being reconciled, the controller MUST invoke a scan job in case `.status.lastScanTime` exceeds the `.spec.scanPeriodMinutes`. | ||
When the `Job` object is created, the controller MUST update `ServerFirmware` object's `.status.runningJob` field with the reference to created job. | ||
Scan job MUST update corresponding `ServerFirmware` object's status on completion: | ||
|
||
- update `.status.firmwares` field | ||
- update `.status.lastScanTime` field | ||
- remove reference to job in `.status.runningJob` field | ||
|
||
When an object contains up-to-date info in `.status.firmwares` field, the controller MUST check whether the target `Server` is in "Available" state. | ||
If the server is not in "Available" state, then reconciliation stops. | ||
Otherwise, the controller MUST compare the desired and current firmware versions stored in `.spec.firmwares` and `.status.firmwares` fields accordingly. | ||
If there is discrepancy between desired and current firmware versions, the controller MUST invoke firmware update job. | ||
When the `Job` object is created, the controller MUST update `ServerFirmware` object's `.status.runningJob` field with the reference to created job. | ||
Firmware versions update job MUST update corresponding `ServerFirmware` object's status on completion: | ||
|
||
- update `.status.firmwares` field | ||
- remove reference to job in `.status.runningJob` field | ||
|
||
<details> | ||
<summary>Reconciliation flow diagram</summary> | ||
<img src="../assets/serverfirmware-controller-flow.jpg" width="50%"> | ||
</details> | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maybe want to incorporate a vendor/manufacturer in this struct as well? I know the
Server
via the ref should have this information, but it might makes sense to have it here as well. Wdyt?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since update service will anyways request for server object - at least to get related bmc for access type and credentials, I think it's not necessary to store manufacturer and model in this resource. But I have no strong opinion on that, since it would not affect anything.