Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mass machine type transition job WIP #2

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Implement mass machine type transition job WIP #2

wants to merge 13 commits into from

Conversation

prnaraya
Copy link
Owner

The mass machine type job is implemented with a Docker image built locally from a golang file, and creating and applying a Job using that image. Working on adding way for user to specify namespace(s) to run the Job on rather than running on all namespaces, and a flag for the Job to force restart all running vms when changing the machine type instead of letting the user do it manually.

prnaraya added 4 commits May 8, 2023 12:48
After going back and forth on whether to implement using nested jobs or not, I settled for
having a single job, with the calculation of the completion count being done in a go function.
There are also some sections still commented out as they are WIP, notably dealing with the actual
machine type strings themselves.

Signed-off-by: prnaraya <[email protected]>
Signed-off-by: prnaraya <[email protected]>
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prnaraya! This is surely a step forward!

Left some comments below.

informers.go Outdated
Comment on lines 49 to 51
machineType := vm.Spec.Template.Spec.Domain.Machine.Type
machineTypeSubstrings := strings.Split(machineType, "-")
version := machineTypeSubstrings[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that every VMI would have a machine type of this format?
I've created a /examples/vmi-fedora.yaml vmi and saw:

spec:
  domain:
    machine:
      type: q35

In this case this job would crash.
I'd recommend to validate it's the right format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see

  machine:
    type: pc-q35-rhel9.2.0

under status though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so after some investigation I think I understand what needs to be done exactly.
Let's take the following as an example:

spec:
  domain:
    machine:
      type: q35
status:
  machine:
    type: pc-q35-rhel8.2.0

So, what would need to be done is:

  1. Parse status.machine.type, specifically the last bit (split by - like you did, take the 3rd string, split by ., take the first string (which in this case is 8), then parse it to a number and check if it's <= 8, like in this case)*.
  2. If the machine type is of rhel8 or below, update spec.domain.machine.type to a RHEL9 version**.
  3. Add the label.

Open questions:
* Can we count on the fact that all machine types at the status level look like pc-q35-rhel8.2.0? Do we support other machine types with different formats?
** Which RHEL version should we update to? 9.0.0? 9.2.0?

informers.go Outdated
machineTypeSubstrings := strings.Split(machineType, "-")
version := machineTypeSubstrings[2]

if strings.Contains(version, "rhel") && version < "rhel9.0.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you try to achieve with version < "rhel9.0.0"? Did you mean version != "rhel9.0.0"?
What if the version is "rhel9.2.0" or something similar?
Do all machine types have a version?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the requirements for updating the machine types is updating the machine type of any VM with a machine type that is prior to RHEL 9. There are some other formats of machine type I believe so I will need validate to prevent this from crashing though.

informers.go Outdated

func addWarningLabel (virtCli kubecli.KubevirtClient, vm *k6tv1.VirtualMachine) {
addLabel := fmt.Sprint(`{"metadata": {"labels": {"restart-vm-required": "true"}}}}}`)
virtCli.VirtualMachine(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(addLabel), &k8sv1.PatchOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about catching an error and logging it if exists?

informers.go Outdated

removeLabel := fmt.Sprint(`[{"op": "remove", "path": "metadata/labels/restart-vm-required"}]`)
virtCli.VirtualMachine(vmi.Namespace).Patch(vmi.Name, types.StrategicMergePatchType, []byte(removeLabel), &k8sv1.PatchOptions{})
needsRestart--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs for every VMI that's being deleted.
So it can decrease the counter for a VMI that didn't need any updates in the first place.

main.go Outdated
Comment on lines 21 to 26
// for now this uses the number of vms to be restarted to determine
// when to exit the job rather than seeing if the actual label exists
// on any VMs before exiting the job

for (needsRestart > 0) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would busy-wait for the variable to become zero, which is pretty costly in terms of performance. A better approach would be to sleep for a few seconds between iterations. Or even better, use a channel to indicate when to wake up and exit.

But, currently, this mechanism is broken since the counter decreases when every VMI is deleted, not necessarily a relevant one. I think you need to build a structure of VMIs that should be restarted (in updateMachineTypes) to really track the relevant ones. Then, when the structure is empty, you can close a channel that would indicate to exit the job.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a temporary fix that I implemented just so I had something there to use when testing against a small cluster, but I wasn't exactly sure how to implement this in a better way.

informers.go Outdated
needsRestart++
}

func removeWarningLabel(obj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to remove a label from a deleted VMI.
I think the main purpose of this event handler is to update that we have one less VMI to track

informers.go Outdated
Comment on lines 36 to 38
vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs {
DeleteFunc: removeWarningLabel,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding an event handler for newly created VMIs?
That way if a new VMI is being created with an unsupported machine type it could be added to the list of VMIs that need update, and a label could be added.

Although I'm not entirely sure it's necessary.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add that but I am also not sure if it is something that we would need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's skip that for now. we can always add that in follow-ups

informers.go Outdated
@@ -49,9 +54,12 @@ func updateMachineTypes(virtCli kubecli.KubevirtClient) error {
machineType := vm.Spec.Template.Spec.Domain.Machine.Type
machineTypeSubstrings := strings.Split(machineType, "-")
version := machineTypeSubstrings[2]
if len(machineTypeSubstrings != 3) {
return nil
}

if strings.Contains(version, "rhel") && version < "rhel9.0.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change version < "rhel9.0.0".
Basically, we need to check if the x-stream version (9 in this case) is >8. In other words, if it's less or equal to 8 (e.g. 8.4.0) it needs to be updated

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Since we only want to update machine types with versions prior to the specified version (in this case RHEL 9), wouldn't a lexographical comparison of the versions be sufficient? Unless there is a format of machine type that I am missing.

In this case, "rhel8.4.0" < "rhel9.0.0" returns true, which satisfies the conditions I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you have a point.

As long as we're sure this is indeed always the format, your approach is completely valid. Saying that, it is never too bad to add some more validations, as this can go silently wrong if something changes (what if the string is empty for some reason? or if the format changes slightly, like rhel-8.0.0 instead of rhel8.0.0).

But again, if you're certain that this is always the format, feel free to leave it as it 👍

and a stop channel for the job to wait until vmiList
is empty

Signed-off-by: prnaraya <[email protected]>
…running vmis

immediately. Also added the base for the namespace and potentially labels flags
to implement filtering of VMs in the future.

Signed-off-by: prnaraya <[email protected]>
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress @prnaraya! Well done!

main.go Outdated
Comment on lines 16 to 17
stopCh := make(chan struct{})
go vmiInformer.Run(stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use exitJob for that purpose too

Suggested change
stopCh := make(chan struct{})
go vmiInformer.Run(stopCh)
exitJob = make(chan struct{})
go vmiInformer.Run(exitJob)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also initialize exitJob which is never initialized currently

main.go Outdated
go vmiInformer.Run(stopCh)


updateMachineTypes(virtCli)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please handle the error (by either printing it to stdout or logging it)

informers.go Outdated
return err
}
for _, vm := range vmList.Items {
machineType := vm.Spec.Template.Spec.Domain.Machine.Type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fetched from status instead of spec

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to fetch this but neither VirtualMachineStatus nor VirtualMachineInstanceStatus have a field for machine types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find it here. As mentioned in this comment, this is how it looks like:

spec:
  domain:
    machine:
      type: q35
status:
  machine:
    type: pc-q35-rhel8.2.0

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a little more, and I believe the reason why the VMI you created has just "q35" in the machine type in spec is because that is the alias for the default machine type (which is the most recent RHEL version I believe) if a machine type is not explicitly specified in the yaml. I created a VM where I explicitly set the machine type, and was able to see the full machine type string in the Spec for both the VM and the VMI. This also works if you modify examples/vmi-fedora.yaml to have an explicitly stated machine type.

That is to say, I think it is still correct to retrieve the machine type from the VM spec, and there already exists logic that will catch any machine types that do not fit the "pc-q35-rhelx.x.x" version, so VMs with a machine type of just "q35" will be ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so VMs with a machine type of just "q35" will be ignored

In my opinion this is a bug.
What if the VMIs I've created were created at a time where 8.2.0 was the default and latest version? Then you're going to ignore these, although they need to be updated.

Am I missing anything?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. I still am not sure how to access the VMI status.machine.type because I just don't see it in any of the documentation anywhere, but it might be good then in the case of the machine type being q35 to update the machine type anyway to be safe.

informers.go Outdated
)

// using this as a const allows us to easily modify the program to update if a newer version is released
// we generally want to be updating the machine types to the most recent version
const latestVersion = "rhel9.2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
const latestVersion = "rhel9.2.0"
const latestMachineTypeVersion = "rhel9.2.0"

informers.go Outdated
if err != nil {
return err
}
vmiList = append(vmiList, vm.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please all add the VM's namespace, that way it's a unique identifier.
BTW, you have MetaNamespaceKeyFunc function from Kubernetes (that you can import) that returns that string for you (i.e. <namespace>/<name>)

informers.go Outdated
listWatcher := cache.NewListWatchFromClient(virtCli.RestClient(), "virtualmachineinstances", k8sv1.NamespaceAll, fields.Everything())
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{})

vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs {
DeleteFunc: removeWarningLabel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, but I think that reboots might not delete the vmi entirely, but rather leave in a Succeeded state. Just to be on the safe side, I'd also add an update handler and re-use the same function you already have.

I would also add a small validation to removeWarningLabel() to make sure the VMI is either deleted or in a non-running phase.

Suggested change
DeleteFunc: removeWarningLabel,
DeleteFunc: removeWarningLabel,
UpdateFunc: func(_, newObj interface{}) {
removeWarningLabel(newObj)
},

informers.go Outdated
Comment on lines 100 to 107
//check if deleted VMI is in list of VMIs that need to be restarted
vmiIndex := searchVMIList(vmi.Name)
if vmiIndex == -1 {
return
}

// remove deleted VMI from list
vmiList = append(vmiList[:vmiIndex], vmiList[vmiIndex+1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it's critical, but I think this could have been much easier with a map.

The map could be of type vmisPendingUpdate := make(map[string]struct{}).
Then, to search: _, exists := vmisPendingUpdate[<vm>]
To delete: delete vmisPendingUpdate[vmisPendingUpdate, <vm>

Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress @prnaraya!! Well done! 👏

informers.go Outdated
}

// add label to running VMs that a restart is required for change to take place
if vm.Status.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if vm.Status.Ready {
if vm.Status.Created {

There's a slight difference between Created, which means that a VMI object exists for this VM, and Ready, which means that the VMI is up and ready (e.g. booted, connected, etc).

If we use Ready here we might miss a VM that had just started but did not fully boot yet. However, at this point, the machine type is already determined.

func handleVmiDeletion(obj interface{}) {
vmi, ok := obj.(*k6tv1.VirtualMachineInstance)
if !ok {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log / print an error in this case, since it never should happen

informers.go Outdated
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{})

vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs {
UpdateFunc: handleVmiUpdate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but it seems I've mislead you :(
I've tested this, and it seems that whenever a VMI is either stopped or restarted the VMI object is being deleted:

  metadata:
    deletionTimestamp: "2023-06-14T06:13:01Z"

So I guess we can drop the update handler after all

informers.go Outdated
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{})
func addWarningLabel (virtCli kubecli.KubevirtClient, vm *k6tv1.VirtualMachine) error {
addLabel := fmt.Sprint(`{"metadata": {"labels": {"restart-vm-required": "true"}}}}}`)
_, err := virtCli.VirtualMachineInstance(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(addLabel), &k8sv1.PatchOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we better add the label to the VM object instead of the VMI.
There are two reasons for that:

  1. Technically, a VMI object cannot be restarted, only a VM object
  2. The user-facing object is the VM object. This is meaningful, since most users are aware (in terms of monitoring, querying data, etc) of the VM object, and the VMI object is treated as an "implementation detail", similarly to the virt-launcher pod.

WDYT?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I think this might have been what I was originally planning to do (I can't remember why I decided not to but I can't think of a good reason for it).

informers.go Outdated
removeWarningLabel(vmi)
}

func removeWarningLabel(vmi *k6tv1.VirtualMachineInstance) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd make the function name a bit more accurate, since it doesn't only remove the label, but also removed the VMI from the map, ends the job if necessary, etc.

WDYT about something like handleDeletedVmi?

main.go Outdated
Comment on lines 12 to 25
restartEnv, exists := os.LookupEnv("RESTART_NOW")
if exists {
restartNow, err = strconv.ParseBool(restartEnv)
if err != nil {
fmt.Println(err)
}
}

// update namespace if env is set
namespaceEnv, exists := os.LookupEnv("NAMESPACE")
if exists {
namespace = namespaceEnv
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👍

main.go Outdated
os.Exit(1)
}


exitJob = make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already initialized above:

var (
	vmisPendingUpdate = make(map[string]struct{})
	exitJob = make(chan struct{})
...

Comment on lines 5 to 14
spec:
template:
metadata:
name: mass-machine-type-transition
spec:
containers:
- name: machine-type-transition
image: machine-type-transition:latest
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the environment variables with default values to the job manifest. It will help guide users to use them properly.

apiVersion: batch/v1
kind: Job
metadata:
  name: mass-machine-type-transition
spec:
  template:
    metadata:
      name: mass-machine-type-transition
    spec:
      containers:
      - name: machine-type-transition
        image: machine-type-transition:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: NAMESPACE
          value: ""  # default is having no selector
        - name: RESTART_NOW
          value: false
      restartPolicy: OnFailure

main.go Outdated
vmiInformer, err := getVmiInformer()
var err error
// update restartNow if env is set
restartEnv, exists := os.LookupEnv("RESTART_NOW")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
restartEnv, exists := os.LookupEnv("RESTART_NOW")
restartEnv, exists := os.LookupEnv("FORCE_RESTART")

Added environment variables to the job yaml

Signed-off-by: prnaraya <[email protected]>
@prnaraya prnaraya marked this pull request as draft June 14, 2023 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants