BIOS/Firmware concept #117

aobort · 2024-09-01T19:36:12Z

Proposed Changes

This is the design concept document for BIOS/Firmware update solution

Fixes #99

Signed-off-by: Artem Bortnikov <[email protected]>

damyan

@aobort Just a small formatting thing I stumbled upon when reading

damyan · 2024-09-02T08:11:19Z

docs/concepts/firmware-update/rfd.md

+#### DiscoveredFirmware
+
+`DiscoveredFirmware` CR represents discovered firmware versions for a specific manufacturer-model.
+The `.spec` of this type contains information about manufacture, concrete model,the desired number of versions to


The .spec of this type contains information about manufacture, concrete model, the desired number of versions to

Yeah, hard wrapping forced by IDE. Hope now it's better.

afritzler · 2024-09-02T08:38:22Z

Thanks @aobort! Without having looked at the content, can we change the proposal structure to something like that https://github.com/gardener/gardener/blob/master/docs/proposals/00-template.md. We are using the same template in the ironcore project as well https://github.com/ironcore-dev/ironcore/blob/main/docs/proposals/00-template.md.

Signed-off-by: Artem Bortnikov <[email protected]>

Nuckal777

Thanks for the proposal. ☀️

Nuckal777 · 2024-09-02T15:13:21Z

docs/proposals/01-firmware-update.md

+  name: bar-group
+spec:
+  manufacturer: Lenovo
+  model: 7x21


How are Servers identified by model? The current api only mentions the manufacturer.

Unfortunately, I do not have answer. Regarding Lenovo, we can get server's model from SKU. What about other manufacturers - no idea yet. Maybe some API exists from which it could be retrieved.
Thoughts behind the necessity of knowing the server's model is that different models could use different hardware and potentially different firmware versions. Hence to be able to automate the process we need to somehow deal with it.

Nuckal777 · 2024-09-02T15:14:53Z

docs/proposals/01-firmware-update.md

+    s5 --> [*]
+```
+
+### Update service


Is the update service stateful? If yes, it's state should likely be in CRDs.

It's supposed to be stateless.

Nuckal777 · 2024-09-02T15:16:35Z

docs/proposals/01-firmware-update.md

+
+Scheduler SHOULD have an embedded job runner component corresponding to the update strategy defined on the update service application start.
+
+#### Job runner


Would it be sufficient, if the scheduler spawns Kubernetes jobs?

Yep, the idea is to leverage kubernetes jobs. But the application which job would execute would depend on the update strategy, for instance:

in case of using vendor's CLI tool, the app inside job will execute that tool

in case of using prepared boot image the app inside job will create boot config and patch server object to use it

etc.

Understood.

Nuckal777 · 2024-09-04T12:47:08Z

Do you envision the the update's service API, scheduler and job runner as independently deployed units? In my opinion I would mash them into the firmware-operator, until scaling issues are actually encountered.

aobort · 2024-09-04T14:47:51Z

Do you envision the the update's service API, scheduler and job runner as independently deployed units? In my opinion I would mash them into the firmware-operator, until scaling issues are actually encountered.

Exactly. Internal components of one service. Like several controllers in one operator

afritzler

This is how I understand the BIOS and Firmware update process (rough sketch):

graph TD
    subgraph ServerObject
        Server --> MaintenanceState
        Server --> ServerBootConfiguration
    end

    ServerBIOS --> Server
    ServerFirmwares --> Server

    subgraph ExternalUpdate
        ExternalUpdateBIOS --> ServerBIOS
        ExternalUpdateFirmwares --> ServerFirmwares
    end

    subgraph Controllers
        ServerBIOSController -.-> ServerBIOS
        ServerFirmwaresController -.-> ServerFirmwares
    end

    ServerBIOSController --|Detects Update|--> MaintenanceState
ServerFirmwaresController --|Detects Update|--> MaintenanceState

ServerBIOSController --|Create|--> ServerBootConfiguration
ServerFirmwaresController --|Create|--> ServerBootConfiguration

ServerBootConfiguration --|Boots into update mode|--> BIOSUpdateProcess
BIOSUpdateProcess --> ServerBIOS

ServerBootConfiguration --|Boots into update mode|--> FirmwareUpdateProcess
FirmwareUpdateProcess --> ServerFirmwares

We have 2 toplevel resources ServerBIOS and ServerFirmwares. An update to those resource (e.g. ExternalUpdateBIOS, initiated by e.g. a user) will trigger the ServerBIOSReconciler or ServerFirmwares reconciler to transition the Server into the Maintanance state. Once this happens, those reconcilers will create a ServerBootConfiguration containing an igintion + OS configuration to boot a Server into an update mode where BIOS or firmwares are patched. A similar approach is happening during the Discovery phase where we provision the metalprobe agent. Once the update is successful (we need to define here how we can figure this out) we update the ServerBIOS and/or ServerFirmwares status with the applied versions.

afritzler · 2024-09-06T12:16:19Z

docs/proposals/01-firmware-update.md

+      manufacturer: Intel
+      version: 2.0.0
+status:
+  lastScanTime: 01-01-2001 01:00:00


Can we use a transition condition for this instead of having a dedicated status field?

The idea was not to make scans conditional, but to make sure that the scan will be launched if previous run was far ago enough. Hence if we'll rely on the condition's transition timestamp there will be no difference comparing with dedicated field. However this will cause the need of some additional computation of the server's state: by proposed design the update of the status will be done by update server after scan job reports it's results. Therefore, to set proper condition the update server will have to know also the desired state and to compute difference between firmware discovered by the scan job and desired firmware defined in object's spec, instead of just updating status with timestamp and firmware.

afritzler · 2024-09-06T12:18:00Z

docs/proposals/01-firmware-update.md

+metadata:
+  name: foo
+spec:
+  scanThreshold: 30m


Is by scanThreshold meant a re-sync period? If so we might think of a better name here.

Yep, naming could be better, for sure

afritzler · 2024-09-06T12:19:16Z

docs/proposals/01-firmware-update.md

+  scanThreshold: 30m
+  serverRef:
+    name: foo
+  bios:


Do we maybe want to incorporate a vendor/manufacturer in this struct as well? I know the Server via the ref should have this information, but it might makes sense to have it here as well. Wdyt?

Since update service will anyways request for server object - at least to get related bmc for access type and credentials, I think it's not necessary to store manufacturer and model in this resource. But I have no strong opinion on that, since it would not affect anything.

afritzler · 2024-09-06T12:21:17Z

docs/proposals/01-firmware-update.md

+      version: 2.0.0
+```
+
+#### ServerFirmwareGroup


Do we really need a grouping at this point already? Maybe we should start with the Firmware/BIOS version handling on an individual Server level by using the ServerFirmware resource defined above. We could generalize/add a higher level construct later on top.

afritzler · 2024-09-06T12:23:29Z

docs/proposals/01-firmware-update.md

+
+```yaml
+apiVersion: metal.ironcore.dev/v1alpha1
+kind: ServerFirmware


Alternative proposal here: How about splitting this resource into a ServerBios and ServerFirmwares. Does it make sense to update the BIOS independently from the Firmware of individual components?

From my perspective it might make sense only if there will be completely different workflows to update BIOS and other firmware. Otherwise we'll just duplicate controllers and double CRs.

afritzler · 2024-09-06T12:28:28Z

docs/proposals/01-firmware-update.md

+  updatesNotApplied: 1
+```
+
+#### DiscoveredFirmware


Do we really want to automatically "discover" new versions of a Firmware? Or should it be better instructed from the outside: Like I know my ServerFirmwares are xyz and now I want to upgrade to version zyx which I would then do via updating the ServerFirmwares CR and the machinery should ensure that.

I bet we want. At least ops should be able to track new versions and for instance check if there are anything related to critical security issues.

defo89 · 2024-09-09T06:52:35Z

docs/proposals/01-firmware-update.md

+- it MUST NOT allow running several jobs on the same target server simultaneously;
+- it MAY discard incoming update or scan requests if the same jobs targeting the same server are already scheduled;
+- it MAY discard incoming discovery requests if the job targeting the same manufacturer-model pair is already scheduled;
+- it MUST have a mechanism to limit the number of parallel jobs;


Did you consider how this would work together with #76 ? Like, if we schedule a Firmware Upgrade how do we signal this intent to the Workload or Management cluster?

Related to above, we should get some kind of health status from Workload cluster when performing an update on a set of servers one by one. Like, when first server upgrade is finished we ensure that cluster is healthy before proceeding with next server.

The idea is to set "Maintenance" state for the server where updates are scheduled. What about upgrading of the cluster, so not to kill it - this might be implemented in update scheduler if there is any API which could help to determine whether current server is a cluster member.

defo89 · 2024-09-09T06:55:30Z

docs/proposals/01-firmware-update.md

+- CancelTask(CancelTaskRequest) CancelTaskResponse;
+  - `CancelTaskRequest` MAY contain the timeout for graceful stop and a flag to force stop.
+  - `CancelTaskResponse` MUST contain the status of the request with error code if any.
+


Could we use this to pause upgrades if we find out that firmware has issues and investigation is needed before proceeding with next servers OR issues in cluster are noticed and we want to halt the upgrade process.

This is the goal for proposed servers grouping - first run updates on dedicated test group. If all is ok, then run updates on prod servers.

Signed-off-by: Artem Bortnikov <[email protected]>

afritzler · 2024-09-19T09:20:11Z

From our discussion yesterday I guess it would make sense to split the BIOS from the Firmware update flow. It might actually make sense to also split it API wise. Wdyt?

defo89 · 2024-09-19T11:57:14Z

+1 for split the BIOS from the Firmware update flow + separate CRDs for this. Going further, should we also split BMC version update?

afritzler · 2024-09-19T12:56:15Z

+1 for split the BIOS from the Firmware update flow + separate CRDs for this. Going further, should we also split BMC version update?

That is a good point. The BIOS update of the BMC should be addressed as well. Not sure if we should do this in this concept OR just focus here on the Server resource first.

Signed-off-by: Artem Bortnikov <[email protected]>

aobort · 2024-09-20T08:18:25Z

From our discussion yesterday I guess it would make sense to split the BIOS from the Firmware update flow. It might actually make sense to also split it API wise. Wdyt?

Tbh, I do not see any advantage in separating bios and firmware. If the bios and firmwares might be incompatible it makes sense to keep them in one object and provide some kind of compatibility matrix to make validation convenient and handy. If they would be stored in separate objects this will just create additional load on API server, when the amount of hardware will be big enough. Also additional objects to store in etcd.

aobort · 2024-09-27T20:12:53Z

@afritzler @defo89

There is a list of bios settings in Server's .spec.BIOS;
There is current bios settings in Server's .status.BIOS;

Current bios version, taken from Server's .status.BIOS.version, is reflected in ServerFirmware object's .spec.bios.version. Now the question: what object should be updated to trigger BIOS update?

If we'll update version in ServerFirmware object, then we'll need to update corresponding Server object's spec, but we cannot, bc we also need to specify BIOS settings. If we'll update Server's status, then it will result into discrepancy between spec and status.

If we'll update BIOS version (with settings) in Server's .spec.bios which will trigger BIOS update job, then it seems to make no sense to store BIOS version also in ServerFirmware object.

Considering all in above, I'd suggest to:

Introduce ServerBIOS API type

apiVersion: ...
kind: ServerBIOS
metadata:
  name: sample
spec:
  serverRef:
    name: compute-1
  versions:
  - version: 1.0.0
    settings: {}  # map of bios settings
  - version: 1.2.0
    settings: {}  # map of bios settings
    currentVersion: true
status:
  version: 1.2.0
  settings: {}. # map of bios settings

Replace bios settings in Server's .spec.bios with reference to the ServerBIOS object
Implement separate controller which will manage both BIOS version and settings

In total, separating only BIOS update flow from firmware update flow still seems to make no sense from my point of view. However, separating whole BIOS management including version and settings from server management and firmware management flows seems to be reasonable.

Signed-off-by: Artem Bortnikov <[email protected]>

firmware update rfd

7dd9c4c

Signed-off-by: Artem Bortnikov <[email protected]>

aobort added the documentation Improvements or additions to documentation label Sep 1, 2024

github-actions bot added the size/L label Sep 1, 2024

damyan reviewed Sep 2, 2024

View reviewed changes

aobort requested review from afritzler and damyan September 2, 2024 11:56

align with proposal template

adcb56d

Signed-off-by: Artem Bortnikov <[email protected]>

aobort force-pushed the issue-99/firmware-update-rfd branch from bfe3f2d to adcb56d Compare September 2, 2024 12:44

Nuckal777 reviewed Sep 2, 2024

View reviewed changes

afritzler requested changes Sep 6, 2024

View reviewed changes

defo89 reviewed Sep 9, 2024

View reviewed changes

update with maintenance state

9695a20

Signed-off-by: Artem Bortnikov <[email protected]>

updated proposal

b00dcd7

Signed-off-by: Artem Bortnikov <[email protected]>

aobort requested review from afritzler, defo89 and Nuckal777 September 20, 2024 08:11

updated

6cd847d

Signed-off-by: Artem Bortnikov <[email protected]>

aobort requested a review from stefanhipfel October 9, 2024 11:02

design doc update

d85db2e

Signed-off-by: Artem Bortnikov <[email protected]>


		Scheduler SHOULD have an embedded job runner component corresponding to the update strategy defined on the update service application start.

		#### Job runner

BIOS/Firmware concept #117

Are you sure you want to change the base?

BIOS/Firmware concept #117

Conversation

aobort commented Sep 1, 2024

Proposed Changes

damyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afritzler commented Sep 2, 2024

Nuckal777 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nuckal777 commented Sep 4, 2024

aobort commented Sep 4, 2024

afritzler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afritzler commented Sep 19, 2024

defo89 commented Sep 19, 2024 • edited Loading

afritzler commented Sep 19, 2024

aobort commented Sep 20, 2024 • edited Loading

aobort commented Sep 27, 2024

defo89 commented Sep 19, 2024 •

edited

Loading

aobort commented Sep 20, 2024 •

edited

Loading