Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIOS/Firmware concept #117

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

aobort
Copy link
Contributor

@aobort aobort commented Sep 1, 2024

Proposed Changes

This is the design concept document for BIOS/Firmware update solution

Fixes #99

Signed-off-by: Artem Bortnikov <[email protected]>
@aobort aobort added the documentation Improvements or additions to documentation label Sep 1, 2024
@github-actions github-actions bot added the size/L label Sep 1, 2024
Copy link
Member

@damyan damyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aobort Just a small formatting thing I stumbled upon when reading

#### DiscoveredFirmware

`DiscoveredFirmware` CR represents discovered firmware versions for a specific manufacturer-model.
The `.spec` of this type contains information about manufacture, concrete model,the desired number of versions to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .spec of this type contains information about manufacture, concrete model, the desired number of versions to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hard wrapping forced by IDE. Hope now it's better.

@afritzler
Copy link
Member

Thanks @aobort! Without having looked at the content, can we change the proposal structure to something like that https://github.com/gardener/gardener/blob/master/docs/proposals/00-template.md. We are using the same template in the ironcore project as well https://github.com/ironcore-dev/ironcore/blob/main/docs/proposals/00-template.md.

@aobort aobort requested review from afritzler and damyan September 2, 2024 11:56
Signed-off-by: Artem Bortnikov <[email protected]>
@aobort aobort force-pushed the issue-99/firmware-update-rfd branch from bfe3f2d to adcb56d Compare September 2, 2024 12:44
Copy link
Contributor

@Nuckal777 Nuckal777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. ☀️

name: bar-group
spec:
manufacturer: Lenovo
model: 7x21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are Servers identified by model? The current api only mentions the manufacturer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I do not have answer. Regarding Lenovo, we can get server's model from SKU. What about other manufacturers - no idea yet. Maybe some API exists from which it could be retrieved.
Thoughts behind the necessity of knowing the server's model is that different models could use different hardware and potentially different firmware versions. Hence to be able to automate the process we need to somehow deal with it.

s5 --> [*]
```

### Update service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the update service stateful? If yes, it's state should likely be in CRDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to be stateless.


Scheduler SHOULD have an embedded job runner component corresponding to the update strategy defined on the update service application start.

#### Job runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient, if the scheduler spawns Kubernetes jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the idea is to leverage kubernetes jobs. But the application which job would execute would depend on the update strategy, for instance:

  • in case of using vendor's CLI tool, the app inside job will execute that tool
  • in case of using prepared boot image the app inside job will create boot config and patch server object to use it
  • etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

@Nuckal777
Copy link
Contributor

Do you envision the the update's service API, scheduler and job runner as independently deployed units? In my opinion I would mash them into the firmware-operator, until scaling issues are actually encountered.

@aobort
Copy link
Contributor Author

aobort commented Sep 4, 2024

Do you envision the the update's service API, scheduler and job runner as independently deployed units? In my opinion I would mash them into the firmware-operator, until scaling issues are actually encountered.

Exactly. Internal components of one service. Like several controllers in one operator

Copy link
Member

@afritzler afritzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I understand the BIOS and Firmware update process (rough sketch):

graph TD
    subgraph ServerObject
        Server --> MaintenanceState
        Server --> ServerBootConfiguration
    end

    ServerBIOS --> Server
    ServerFirmwares --> Server

    subgraph ExternalUpdate
        ExternalUpdateBIOS --> ServerBIOS
        ExternalUpdateFirmwares --> ServerFirmwares
    end

    subgraph Controllers
        ServerBIOSController -.-> ServerBIOS
        ServerFirmwaresController -.-> ServerFirmwares
    end

    ServerBIOSController --|Detects Update|--> MaintenanceState
ServerFirmwaresController --|Detects Update|--> MaintenanceState

ServerBIOSController --|Create|--> ServerBootConfiguration
ServerFirmwaresController --|Create|--> ServerBootConfiguration

ServerBootConfiguration --|Boots into update mode|--> BIOSUpdateProcess
BIOSUpdateProcess --> ServerBIOS

ServerBootConfiguration --|Boots into update mode|--> FirmwareUpdateProcess
FirmwareUpdateProcess --> ServerFirmwares
Loading

We have 2 toplevel resources ServerBIOS and ServerFirmwares. An update to those resource (e.g. ExternalUpdateBIOS, initiated by e.g. a user) will trigger the ServerBIOSReconciler or ServerFirmwares reconciler to transition the Server into the Maintanance state. Once this happens, those reconcilers will create a ServerBootConfiguration containing an igintion + OS configuration to boot a Server into an update mode where BIOS or firmwares are patched. A similar approach is happening during the Discovery phase where we provision the metalprobe agent. Once the update is successful (we need to define here how we can figure this out) we update the ServerBIOS and/or ServerFirmwares status with the applied versions.

manufacturer: Intel
version: 2.0.0
status:
lastScanTime: 01-01-2001 01:00:00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a transition condition for this instead of having a dedicated status field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was not to make scans conditional, but to make sure that the scan will be launched if previous run was far ago enough. Hence if we'll rely on the condition's transition timestamp there will be no difference comparing with dedicated field. However this will cause the need of some additional computation of the server's state: by proposed design the update of the status will be done by update server after scan job reports it's results. Therefore, to set proper condition the update server will have to know also the desired state and to compute difference between firmware discovered by the scan job and desired firmware defined in object's spec, instead of just updating status with timestamp and firmware.

metadata:
name: foo
spec:
scanThreshold: 30m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is by scanThreshold meant a re-sync period? If so we might think of a better name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, naming could be better, for sure

scanThreshold: 30m
serverRef:
name: foo
bios:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want to incorporate a vendor/manufacturer in this struct as well? I know the Server via the ref should have this information, but it might makes sense to have it here as well. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since update service will anyways request for server object - at least to get related bmc for access type and credentials, I think it's not necessary to store manufacturer and model in this resource. But I have no strong opinion on that, since it would not affect anything.

version: 2.0.0
```

#### ServerFirmwareGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a grouping at this point already? Maybe we should start with the Firmware/BIOS version handling on an individual Server level by using the ServerFirmware resource defined above. We could generalize/add a higher level construct later on top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sure.


```yaml
apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerFirmware
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative proposal here: How about splitting this resource into a ServerBios and ServerFirmwares. Does it make sense to update the BIOS independently from the Firmware of individual components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective it might make sense only if there will be completely different workflows to update BIOS and other firmware. Otherwise we'll just duplicate controllers and double CRs.

updatesNotApplied: 1
```

#### DiscoveredFirmware
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to automatically "discover" new versions of a Firmware? Or should it be better instructed from the outside: Like I know my ServerFirmwares are xyz and now I want to upgrade to version zyx which I would then do via updating the ServerFirmwares CR and the machinery should ensure that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet we want. At least ops should be able to track new versions and for instance check if there are anything related to critical security issues.

- it MUST NOT allow running several jobs on the same target server simultaneously;
- it MAY discard incoming update or scan requests if the same jobs targeting the same server are already scheduled;
- it MAY discard incoming discovery requests if the job targeting the same manufacturer-model pair is already scheduled;
- it MUST have a mechanism to limit the number of parallel jobs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider how this would work together with #76 ? Like, if we schedule a Firmware Upgrade how do we signal this intent to the Workload or Management cluster?

Related to above, we should get some kind of health status from Workload cluster when performing an update on a set of servers one by one. Like, when first server upgrade is finished we ensure that cluster is healthy before proceeding with next server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to set "Maintenance" state for the server where updates are scheduled. What about upgrading of the cluster, so not to kill it - this might be implemented in update scheduler if there is any API which could help to determine whether current server is a cluster member.

- CancelTask(CancelTaskRequest) CancelTaskResponse;
- `CancelTaskRequest` MAY contain the timeout for graceful stop and a flag to force stop.
- `CancelTaskResponse` MUST contain the status of the request with error code if any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use this to pause upgrades if we find out that firmware has issues and investigation is needed before proceeding with next servers OR issues in cluster are noticed and we want to halt the upgrade process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the goal for proposed servers grouping - first run updates on dedicated test group. If all is ok, then run updates on prod servers.

Signed-off-by: Artem Bortnikov <[email protected]>
@afritzler
Copy link
Member

From our discussion yesterday I guess it would make sense to split the BIOS from the Firmware update flow. It might actually make sense to also split it API wise. Wdyt?

@defo89
Copy link
Contributor

defo89 commented Sep 19, 2024

+1 for split the BIOS from the Firmware update flow + separate CRDs for this. Going further, should we also split BMC version update?

@afritzler
Copy link
Member

+1 for split the BIOS from the Firmware update flow + separate CRDs for this. Going further, should we also split BMC version update?

That is a good point. The BIOS update of the BMC should be addressed as well. Not sure if we should do this in this concept OR just focus here on the Server resource first.

Signed-off-by: Artem Bortnikov <[email protected]>
@aobort
Copy link
Contributor Author

aobort commented Sep 20, 2024

From our discussion yesterday I guess it would make sense to split the BIOS from the Firmware update flow. It might actually make sense to also split it API wise. Wdyt?

Tbh, I do not see any advantage in separating bios and firmware. If the bios and firmwares might be incompatible it makes sense to keep them in one object and provide some kind of compatibility matrix to make validation convenient and handy. If they would be stored in separate objects this will just create additional load on API server, when the amount of hardware will be big enough. Also additional objects to store in etcd.

@aobort
Copy link
Contributor Author

aobort commented Sep 27, 2024

@afritzler @defo89

  • There is a list of bios settings in Server's .spec.BIOS;
  • There is current bios settings in Server's .status.BIOS;

Current bios version, taken from Server's .status.BIOS.version, is reflected in ServerFirmware object's .spec.bios.version. Now the question: what object should be updated to trigger BIOS update?

If we'll update version in ServerFirmware object, then we'll need to update corresponding Server object's spec, but we cannot, bc we also need to specify BIOS settings. If we'll update Server's status, then it will result into discrepancy between spec and status.

If we'll update BIOS version (with settings) in Server's .spec.bios which will trigger BIOS update job, then it seems to make no sense to store BIOS version also in ServerFirmware object.

Considering all in above, I'd suggest to:

  1. Introduce ServerBIOS API type

    apiVersion: ...
    kind: ServerBIOS
    metadata:
      name: sample
    spec:
      serverRef:
        name: compute-1
      versions:
      - version: 1.0.0
        settings: {}  # map of bios settings
      - version: 1.2.0
        settings: {}  # map of bios settings
        currentVersion: true
    status:
      version: 1.2.0
      settings: {}. # map of bios settings
  2. Replace bios settings in Server's .spec.bios with reference to the ServerBIOS object

  3. Implement separate controller which will manage both BIOS version and settings

In total, separating only BIOS update flow from firmware update flow still seems to make no sense from my point of view. However, separating whole BIOS management including version and settings from server management and firmware management flows seems to be reasonable.

Signed-off-by: Artem Bortnikov <[email protected]>
@aobort aobort requested a review from stefanhipfel October 9, 2024 11:02
Signed-off-by: Artem Bortnikov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Draft] BIOS/Firmware Update
5 participants