-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIOS/Firmware concept #117
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Artem Bortnikov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aobort Just a small formatting thing I stumbled upon when reading
docs/concepts/firmware-update/rfd.md
Outdated
#### DiscoveredFirmware | ||
|
||
`DiscoveredFirmware` CR represents discovered firmware versions for a specific manufacturer-model. | ||
The `.spec` of this type contains information about manufacture, concrete model,the desired number of versions to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .spec
of this type contains information about manufacture, concrete model, the desired number of versions to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hard wrapping forced by IDE. Hope now it's better.
Thanks @aobort! Without having looked at the content, can we change the proposal structure to something like that https://github.com/gardener/gardener/blob/master/docs/proposals/00-template.md. We are using the same template in the |
Signed-off-by: Artem Bortnikov <[email protected]>
bfe3f2d
to
adcb56d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. ☀️
docs/proposals/01-firmware-update.md
Outdated
name: bar-group | ||
spec: | ||
manufacturer: Lenovo | ||
model: 7x21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are Servers identified by model? The current api only mentions the manufacturer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I do not have answer. Regarding Lenovo, we can get server's model from SKU. What about other manufacturers - no idea yet. Maybe some API exists from which it could be retrieved.
Thoughts behind the necessity of knowing the server's model is that different models could use different hardware and potentially different firmware versions. Hence to be able to automate the process we need to somehow deal with it.
docs/proposals/01-firmware-update.md
Outdated
s5 --> [*] | ||
``` | ||
|
||
### Update service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the update service stateful? If yes, it's state should likely be in CRDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to be stateless.
docs/proposals/01-firmware-update.md
Outdated
|
||
Scheduler SHOULD have an embedded job runner component corresponding to the update strategy defined on the update service application start. | ||
|
||
#### Job runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient, if the scheduler spawns Kubernetes jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the idea is to leverage kubernetes jobs. But the application which job would execute would depend on the update strategy, for instance:
- in case of using vendor's CLI tool, the app inside job will execute that tool
- in case of using prepared boot image the app inside job will create boot config and patch server object to use it
- etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
Do you envision the the update's service API, scheduler and job runner as independently deployed units? In my opinion I would mash them into the firmware-operator, until scaling issues are actually encountered. |
Exactly. Internal components of one service. Like several controllers in one operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I understand the BIOS and Firmware update process (rough sketch):
graph TD
subgraph ServerObject
Server --> MaintenanceState
Server --> ServerBootConfiguration
end
ServerBIOS --> Server
ServerFirmwares --> Server
subgraph ExternalUpdate
ExternalUpdateBIOS --> ServerBIOS
ExternalUpdateFirmwares --> ServerFirmwares
end
subgraph Controllers
ServerBIOSController -.-> ServerBIOS
ServerFirmwaresController -.-> ServerFirmwares
end
ServerBIOSController --|Detects Update|--> MaintenanceState
ServerFirmwaresController --|Detects Update|--> MaintenanceState
ServerBIOSController --|Create|--> ServerBootConfiguration
ServerFirmwaresController --|Create|--> ServerBootConfiguration
ServerBootConfiguration --|Boots into update mode|--> BIOSUpdateProcess
BIOSUpdateProcess --> ServerBIOS
ServerBootConfiguration --|Boots into update mode|--> FirmwareUpdateProcess
FirmwareUpdateProcess --> ServerFirmwares
We have 2 toplevel resources ServerBIOS
and ServerFirmwares
. An update to those resource (e.g. ExternalUpdateBIOS
, initiated by e.g. a user) will trigger the ServerBIOSReconciler
or ServerFirmwares
reconciler to transition the Server
into the Maintanance
state. Once this happens, those reconcilers will create a ServerBootConfiguration
containing an igintion + OS configuration to boot a Server into an update mode where BIOS or firmwares are patched. A similar approach is happening during the Discovery
phase where we provision the metalprobe
agent. Once the update is successful (we need to define here how we can figure this out) we update the ServerBIOS
and/or ServerFirmwares
status with the applied versions.
docs/proposals/01-firmware-update.md
Outdated
manufacturer: Intel | ||
version: 2.0.0 | ||
status: | ||
lastScanTime: 01-01-2001 01:00:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a transition condition for this instead of having a dedicated status field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was not to make scans conditional, but to make sure that the scan will be launched if previous run was far ago enough. Hence if we'll rely on the condition's transition timestamp there will be no difference comparing with dedicated field. However this will cause the need of some additional computation of the server's state: by proposed design the update of the status will be done by update server after scan job reports it's results. Therefore, to set proper condition the update server will have to know also the desired state and to compute difference between firmware discovered by the scan job and desired firmware defined in object's spec, instead of just updating status with timestamp and firmware.
docs/proposals/01-firmware-update.md
Outdated
metadata: | ||
name: foo | ||
spec: | ||
scanThreshold: 30m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is by scanThreshold
meant a re-sync period? If so we might think of a better name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, naming could be better, for sure
scanThreshold: 30m | ||
serverRef: | ||
name: foo | ||
bios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maybe want to incorporate a vendor/manufacturer in this struct as well? I know the Server
via the ref should have this information, but it might makes sense to have it here as well. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since update service will anyways request for server object - at least to get related bmc for access type and credentials, I think it's not necessary to store manufacturer and model in this resource. But I have no strong opinion on that, since it would not affect anything.
docs/proposals/01-firmware-update.md
Outdated
version: 2.0.0 | ||
``` | ||
|
||
#### ServerFirmwareGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a grouping at this point already? Maybe we should start with the Firmware/BIOS version handling on an individual Server
level by using the ServerFirmware
resource defined above. We could generalize/add a higher level construct later on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sure.
docs/proposals/01-firmware-update.md
Outdated
|
||
```yaml | ||
apiVersion: metal.ironcore.dev/v1alpha1 | ||
kind: ServerFirmware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative proposal here: How about splitting this resource into a ServerBios
and ServerFirmwares
. Does it make sense to update the BIOS independently from the Firmware of individual components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective it might make sense only if there will be completely different workflows to update BIOS and other firmware. Otherwise we'll just duplicate controllers and double CRs.
docs/proposals/01-firmware-update.md
Outdated
updatesNotApplied: 1 | ||
``` | ||
|
||
#### DiscoveredFirmware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to automatically "discover" new versions of a Firmware? Or should it be better instructed from the outside: Like I know my ServerFirmwares
are xyz and now I want to upgrade to version zyx which I would then do via updating the ServerFirmwares
CR and the machinery should ensure that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet we want. At least ops should be able to track new versions and for instance check if there are anything related to critical security issues.
docs/proposals/01-firmware-update.md
Outdated
- it MUST NOT allow running several jobs on the same target server simultaneously; | ||
- it MAY discard incoming update or scan requests if the same jobs targeting the same server are already scheduled; | ||
- it MAY discard incoming discovery requests if the job targeting the same manufacturer-model pair is already scheduled; | ||
- it MUST have a mechanism to limit the number of parallel jobs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider how this would work together with #76 ? Like, if we schedule a Firmware Upgrade how do we signal this intent to the Workload or Management cluster?
Related to above, we should get some kind of health status from Workload cluster when performing an update on a set of servers one by one. Like, when first server upgrade is finished we ensure that cluster is healthy
before proceeding with next server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to set "Maintenance" state for the server where updates are scheduled. What about upgrading of the cluster, so not to kill it - this might be implemented in update scheduler if there is any API which could help to determine whether current server is a cluster member.
- CancelTask(CancelTaskRequest) CancelTaskResponse; | ||
- `CancelTaskRequest` MAY contain the timeout for graceful stop and a flag to force stop. | ||
- `CancelTaskResponse` MUST contain the status of the request with error code if any. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use this to pause upgrades if we find out that firmware has issues and investigation is needed before proceeding with next servers OR issues in cluster are noticed and we want to halt the upgrade process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the goal for proposed servers grouping - first run updates on dedicated test group. If all is ok, then run updates on prod servers.
Signed-off-by: Artem Bortnikov <[email protected]>
From our discussion yesterday I guess it would make sense to split the BIOS from the Firmware update flow. It might actually make sense to also split it API wise. Wdyt? |
|
That is a good point. The BIOS update of the BMC should be addressed as well. Not sure if we should do this in this concept OR just focus here on the |
Signed-off-by: Artem Bortnikov <[email protected]>
Tbh, I do not see any advantage in separating bios and firmware. If the bios and firmwares might be incompatible it makes sense to keep them in one object and provide some kind of compatibility matrix to make validation convenient and handy. If they would be stored in separate objects this will just create additional load on API server, when the amount of hardware will be big enough. Also additional objects to store in etcd. |
Current bios version, taken from If we'll update version in If we'll update BIOS version (with settings) in Considering all in above, I'd suggest to:
In total, separating only BIOS update flow from firmware update flow still seems to make no sense from my point of view. However, separating whole BIOS management including version and settings from server management and firmware management flows seems to be reasonable. |
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Proposed Changes
This is the design concept document for BIOS/Firmware update solution
Fixes #99