-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement mass machine type transition job WIP #2
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: prnaraya <[email protected]>
After going back and forth on whether to implement using nested jobs or not, I settled for having a single job, with the calculation of the completion count being done in a go function. There are also some sections still commented out as they are WIP, notably dealing with the actual machine type strings themselves. Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @prnaraya! This is surely a step forward!
Left some comments below.
informers.go
Outdated
machineType := vm.Spec.Template.Spec.Domain.Machine.Type | ||
machineTypeSubstrings := strings.Split(machineType, "-") | ||
version := machineTypeSubstrings[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that every VMI would have a machine type of this format?
I've created a /examples/vmi-fedora.yaml
vmi and saw:
spec:
domain:
machine:
type: q35
In this case this job would crash.
I'd recommend to validate it's the right format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see
machine:
type: pc-q35-rhel9.2.0
under status though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so after some investigation I think I understand what needs to be done exactly.
Let's take the following as an example:
spec:
domain:
machine:
type: q35
status:
machine:
type: pc-q35-rhel8.2.0
So, what would need to be done is:
- Parse
status.machine.type
, specifically the last bit (split by-
like you did, take the 3rd string, split by.
, take the first string (which in this case is8
), then parse it to a number and check if it's<= 8
, like in this case)*. - If the machine type is of rhel8 or below, update
spec.domain.machine.type
to a RHEL9 version**. - Add the label.
Open questions:
* Can we count on the fact that all machine types at the status level look like pc-q35-rhel8.2.0
? Do we support other machine types with different formats?
** Which RHEL version should we update to? 9.0.0
? 9.2.0
?
informers.go
Outdated
machineTypeSubstrings := strings.Split(machineType, "-") | ||
version := machineTypeSubstrings[2] | ||
|
||
if strings.Contains(version, "rhel") && version < "rhel9.0.0" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you try to achieve with version < "rhel9.0.0"
? Did you mean version != "rhel9.0.0"
?
What if the version is "rhel9.2.0"
or something similar?
Do all machine types have a version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the requirements for updating the machine types is updating the machine type of any VM with a machine type that is prior to RHEL 9. There are some other formats of machine type I believe so I will need validate to prevent this from crashing though.
informers.go
Outdated
|
||
func addWarningLabel (virtCli kubecli.KubevirtClient, vm *k6tv1.VirtualMachine) { | ||
addLabel := fmt.Sprint(`{"metadata": {"labels": {"restart-vm-required": "true"}}}}}`) | ||
virtCli.VirtualMachine(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(addLabel), &k8sv1.PatchOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about catching an error and logging it if exists?
informers.go
Outdated
|
||
removeLabel := fmt.Sprint(`[{"op": "remove", "path": "metadata/labels/restart-vm-required"}]`) | ||
virtCli.VirtualMachine(vmi.Namespace).Patch(vmi.Name, types.StrategicMergePatchType, []byte(removeLabel), &k8sv1.PatchOptions{}) | ||
needsRestart-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This occurs for every VMI that's being deleted.
So it can decrease the counter for a VMI that didn't need any updates in the first place.
main.go
Outdated
// for now this uses the number of vms to be restarted to determine | ||
// when to exit the job rather than seeing if the actual label exists | ||
// on any VMs before exiting the job | ||
|
||
for (needsRestart > 0) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would busy-wait for the variable to become zero, which is pretty costly in terms of performance. A better approach would be to sleep for a few seconds between iterations. Or even better, use a channel to indicate when to wake up and exit.
But, currently, this mechanism is broken since the counter decreases when every VMI is deleted, not necessarily a relevant one. I think you need to build a structure of VMIs that should be restarted (in updateMachineTypes
) to really track the relevant ones. Then, when the structure is empty, you can close a channel that would indicate to exit the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a temporary fix that I implemented just so I had something there to use when testing against a small cluster, but I wasn't exactly sure how to implement this in a better way.
informers.go
Outdated
needsRestart++ | ||
} | ||
|
||
func removeWarningLabel(obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to remove a label from a deleted VMI.
I think the main purpose of this event handler is to update that we have one less VMI to track
informers.go
Outdated
vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs { | ||
DeleteFunc: removeWarningLabel, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about adding an event handler for newly created VMIs?
That way if a new VMI is being created with an unsupported machine type it could be added to the list of VMIs that need update, and a label could be added.
Although I'm not entirely sure it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add that but I am also not sure if it is something that we would need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's skip that for now. we can always add that in follow-ups
informers.go
Outdated
@@ -49,9 +54,12 @@ func updateMachineTypes(virtCli kubecli.KubevirtClient) error { | |||
machineType := vm.Spec.Template.Spec.Domain.Machine.Type | |||
machineTypeSubstrings := strings.Split(machineType, "-") | |||
version := machineTypeSubstrings[2] | |||
if len(machineTypeSubstrings != 3) { | |||
return nil | |||
} | |||
|
|||
if strings.Contains(version, "rhel") && version < "rhel9.0.0" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change version < "rhel9.0.0"
.
Basically, we need to check if the x-stream version (9 in this case) is >8. In other words, if it's less or equal to 8 (e.g. 8.4.0) it needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand. Since we only want to update machine types with versions prior to the specified version (in this case RHEL 9), wouldn't a lexographical comparison of the versions be sufficient? Unless there is a format of machine type that I am missing.
In this case, "rhel8.4.0" < "rhel9.0.0" returns true, which satisfies the conditions I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you have a point.
As long as we're sure this is indeed always the format, your approach is completely valid. Saying that, it is never too bad to add some more validations, as this can go silently wrong if something changes (what if the string is empty for some reason? or if the format changes slightly, like rhel-8.0.0
instead of rhel8.0.0
).
But again, if you're certain that this is always the format, feel free to leave it as it 👍
and a stop channel for the job to wait until vmiList is empty Signed-off-by: prnaraya <[email protected]>
…running vmis immediately. Also added the base for the namespace and potentially labels flags to implement filtering of VMs in the future. Signed-off-by: prnaraya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome progress @prnaraya! Well done!
main.go
Outdated
stopCh := make(chan struct{}) | ||
go vmiInformer.Run(stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use exitJob
for that purpose too
stopCh := make(chan struct{}) | |
go vmiInformer.Run(stopCh) | |
exitJob = make(chan struct{}) | |
go vmiInformer.Run(exitJob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also initialize exitJob
which is never initialized currently
main.go
Outdated
go vmiInformer.Run(stopCh) | ||
|
||
|
||
updateMachineTypes(virtCli) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please handle the error (by either printing it to stdout or logging it)
informers.go
Outdated
return err | ||
} | ||
for _, vm := range vmList.Items { | ||
machineType := vm.Spec.Template.Spec.Domain.Machine.Type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fetched from status instead of spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to fetch this but neither VirtualMachineStatus nor VirtualMachineInstanceStatus have a field for machine types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find it here. As mentioned in this comment, this is how it looks like:
spec:
domain:
machine:
type: q35
status:
machine:
type: pc-q35-rhel8.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a little more, and I believe the reason why the VMI you created has just "q35" in the machine type in spec is because that is the alias for the default machine type (which is the most recent RHEL version I believe) if a machine type is not explicitly specified in the yaml. I created a VM where I explicitly set the machine type, and was able to see the full machine type string in the Spec for both the VM and the VMI. This also works if you modify examples/vmi-fedora.yaml to have an explicitly stated machine type.
That is to say, I think it is still correct to retrieve the machine type from the VM spec, and there already exists logic that will catch any machine types that do not fit the "pc-q35-rhelx.x.x" version, so VMs with a machine type of just "q35" will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so VMs with a machine type of just "q35" will be ignored
In my opinion this is a bug.
What if the VMIs I've created were created at a time where 8.2.0
was the default and latest version? Then you're going to ignore these, although they need to be updated.
Am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. I still am not sure how to access the VMI status.machine.type because I just don't see it in any of the documentation anywhere, but it might be good then in the case of the machine type being q35
to update the machine type anyway to be safe.
informers.go
Outdated
) | ||
|
||
// using this as a const allows us to easily modify the program to update if a newer version is released | ||
// we generally want to be updating the machine types to the most recent version | ||
const latestVersion = "rhel9.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
const latestVersion = "rhel9.2.0" | |
const latestMachineTypeVersion = "rhel9.2.0" |
informers.go
Outdated
if err != nil { | ||
return err | ||
} | ||
vmiList = append(vmiList, vm.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please all add the VM's namespace, that way it's a unique identifier.
BTW, you have MetaNamespaceKeyFunc
function from Kubernetes (that you can import) that returns that string for you (i.e. <namespace>/<name>
)
informers.go
Outdated
listWatcher := cache.NewListWatchFromClient(virtCli.RestClient(), "virtualmachineinstances", k8sv1.NamespaceAll, fields.Everything()) | ||
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{}) | ||
|
||
vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs { | ||
DeleteFunc: removeWarningLabel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure, but I think that reboots might not delete the vmi entirely, but rather leave in a Succeeded
state. Just to be on the safe side, I'd also add an update handler and re-use the same function you already have.
I would also add a small validation to removeWarningLabel()
to make sure the VMI is either deleted or in a non-running phase.
DeleteFunc: removeWarningLabel, | |
DeleteFunc: removeWarningLabel, | |
UpdateFunc: func(_, newObj interface{}) { | |
removeWarningLabel(newObj) | |
}, |
informers.go
Outdated
//check if deleted VMI is in list of VMIs that need to be restarted | ||
vmiIndex := searchVMIList(vmi.Name) | ||
if vmiIndex == -1 { | ||
return | ||
} | ||
|
||
// remove deleted VMI from list | ||
vmiList = append(vmiList[:vmiIndex], vmiList[vmiIndex+1:]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it's critical, but I think this could have been much easier with a map.
The map could be of type vmisPendingUpdate := make(map[string]struct{})
.
Then, to search: _, exists := vmisPendingUpdate[<vm>]
To delete: delete vmisPendingUpdate[vmisPendingUpdate, <vm>
Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome progress @prnaraya!! Well done! 👏
informers.go
Outdated
} | ||
|
||
// add label to running VMs that a restart is required for change to take place | ||
if vm.Status.Ready { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if vm.Status.Ready { | |
if vm.Status.Created { |
There's a slight difference between Created
, which means that a VMI object exists for this VM, and Ready
, which means that the VMI is up and ready (e.g. booted, connected, etc).
If we use Ready
here we might miss a VM that had just started but did not fully boot yet. However, at this point, the machine type is already determined.
func handleVmiDeletion(obj interface{}) { | ||
vmi, ok := obj.(*k6tv1.VirtualMachineInstance) | ||
if !ok { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log / print an error in this case, since it never should happen
informers.go
Outdated
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{}) | ||
|
||
vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs { | ||
UpdateFunc: handleVmiUpdate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but it seems I've mislead you :(
I've tested this, and it seems that whenever a VMI is either stopped or restarted the VMI object is being deleted:
metadata:
deletionTimestamp: "2023-06-14T06:13:01Z"
So I guess we can drop the update handler after all
informers.go
Outdated
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{}) | ||
func addWarningLabel (virtCli kubecli.KubevirtClient, vm *k6tv1.VirtualMachine) error { | ||
addLabel := fmt.Sprint(`{"metadata": {"labels": {"restart-vm-required": "true"}}}}}`) | ||
_, err := virtCli.VirtualMachineInstance(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(addLabel), &k8sv1.PatchOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we better add the label to the VM object instead of the VMI.
There are two reasons for that:
- Technically, a VMI object cannot be restarted, only a VM object
- The user-facing object is the VM object. This is meaningful, since most users are aware (in terms of monitoring, querying data, etc) of the VM object, and the VMI object is treated as an "implementation detail", similarly to the virt-launcher pod.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I think this might have been what I was originally planning to do (I can't remember why I decided not to but I can't think of a good reason for it).
informers.go
Outdated
removeWarningLabel(vmi) | ||
} | ||
|
||
func removeWarningLabel(vmi *k6tv1.VirtualMachineInstance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd make the function name a bit more accurate, since it doesn't only remove the label, but also removed the VMI from the map, ends the job if necessary, etc.
WDYT about something like handleDeletedVmi
?
main.go
Outdated
restartEnv, exists := os.LookupEnv("RESTART_NOW") | ||
if exists { | ||
restartNow, err = strconv.ParseBool(restartEnv) | ||
if err != nil { | ||
fmt.Println(err) | ||
} | ||
} | ||
|
||
// update namespace if env is set | ||
namespaceEnv, exists := os.LookupEnv("NAMESPACE") | ||
if exists { | ||
namespace = namespaceEnv | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 👍
main.go
Outdated
os.Exit(1) | ||
} | ||
|
||
|
||
exitJob = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already initialized above:
var (
vmisPendingUpdate = make(map[string]struct{})
exitJob = make(chan struct{})
...
spec: | ||
template: | ||
metadata: | ||
name: mass-machine-type-transition | ||
spec: | ||
containers: | ||
- name: machine-type-transition | ||
image: machine-type-transition:latest | ||
imagePullPolicy: IfNotPresent | ||
restartPolicy: OnFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the environment variables with default values to the job manifest. It will help guide users to use them properly.
apiVersion: batch/v1
kind: Job
metadata:
name: mass-machine-type-transition
spec:
template:
metadata:
name: mass-machine-type-transition
spec:
containers:
- name: machine-type-transition
image: machine-type-transition:latest
imagePullPolicy: IfNotPresent
env:
- name: NAMESPACE
value: "" # default is having no selector
- name: RESTART_NOW
value: false
restartPolicy: OnFailure
main.go
Outdated
vmiInformer, err := getVmiInformer() | ||
var err error | ||
// update restartNow if env is set | ||
restartEnv, exists := os.LookupEnv("RESTART_NOW") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT?
restartEnv, exists := os.LookupEnv("RESTART_NOW") | |
restartEnv, exists := os.LookupEnv("FORCE_RESTART") |
Added environment variables to the job yaml Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
…he VMI Also updated HandleDeletedVmi to remove the label from the VM accordingly. Signed-off-by: prnaraya <[email protected]>
The mass machine type job is implemented with a Docker image built locally from a golang file, and creating and applying a Job using that image. Working on adding way for user to specify namespace(s) to run the Job on rather than running on all namespaces, and a flag for the Job to force restart all running vms when changing the machine type instead of letting the user do it manually.