Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mass machine type transition job WIP #2

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM golang:alpine
RUN mkdir /app
ADD . /app/
WORKDIR /app
RUN go build -o main .
RUN adduser -S -D -H -h /app appuser
USER appuser
CMD ["./main"]
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
# Mass machine type transition
This project contains the k8s job that will perform a mass machine type transition on an OpenShift cluster.

## Notes
This is a more cohesive version of the mass machine type transition. This implementation uses Docker to create an image from the golang script that performs the machine type transition, and hangs until all warnings are cleared. The .yaml file is a job that uses this generated image, and will only be completed once this script exits, i.e. there are no warning labels left on any VMs. To run this locally, you will need to build a Docker image with the tag "machine-type-transition", then start the job with kubectl apply. This is not comprehensively tested yet; in the future I will be automating the process of running this job, as well as adding functionality for the user to specify flags to update machine types only on specific namespaces, and to force a restart/shut down of all running VMs when updating their machine types.
142 changes: 135 additions & 7 deletions informers.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,32 @@
package mass_machine_type_transition

import (
"fmt"
"strings"
"time"

k8sv1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/fields"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/tools/cache"
k6tv1 "kubevirt.io/api/core/v1"
"kubevirt.io/client-go/kubecli"
"time"
)

// using this as a const allows us to easily modify the program to update if a newer version is released
// we generally want to be updating the machine types to the most recent version
const latestMachineTypeVersion = "rhel9.2.0"

var (
vmisPendingUpdate = make(map[string]struct{})
exitJob = make(chan struct{})

// by default, update machine type across all namespaces
namespace = k8sv1.NamespaceAll
//labels []string

// by default, should require manual restarting of VMIs
restartNow = false
)

func getVirtCli() (kubecli.KubevirtClient, error) {
Expand All @@ -23,14 +43,122 @@ func getVirtCli() (kubecli.KubevirtClient, error) {
return virtCli, err
}

func getVmiInformer() (cache.SharedIndexInformer, error) {
virtCli, err := getVirtCli()
func getVmiInformer(virtCli kubecli.KubevirtClient) (cache.SharedIndexInformer, error) {
listWatcher := cache.NewListWatchFromClient(virtCli.RestClient(), "virtualmachineinstances", namespace, fields.Everything())
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{})

vmiInformer.AddEventHandler(cache.ResourceEventHandlerFuncs {
UpdateFunc: handleVmiUpdate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but it seems I've mislead you :(
I've tested this, and it seems that whenever a VMI is either stopped or restarted the VMI object is being deleted:

  metadata:
    deletionTimestamp: "2023-06-14T06:13:01Z"

So I guess we can drop the update handler after all

DeleteFunc: handleVmiDeletion,
})

return vmiInformer, nil
}

func updateMachineTypes(virtCli kubecli.KubevirtClient) error {
vmList, err := virtCli.VirtualMachine(namespace).List(&k8sv1.ListOptions{})
if err != nil {
return nil, err
return err
}
for _, vm := range vmList.Items {
machineType := vm.Spec.Template.Spec.Domain.Machine.Type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fetched from status instead of spec

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to fetch this but neither VirtualMachineStatus nor VirtualMachineInstanceStatus have a field for machine types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find it here. As mentioned in this comment, this is how it looks like:

spec:
  domain:
    machine:
      type: q35
status:
  machine:
    type: pc-q35-rhel8.2.0

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a little more, and I believe the reason why the VMI you created has just "q35" in the machine type in spec is because that is the alias for the default machine type (which is the most recent RHEL version I believe) if a machine type is not explicitly specified in the yaml. I created a VM where I explicitly set the machine type, and was able to see the full machine type string in the Spec for both the VM and the VMI. This also works if you modify examples/vmi-fedora.yaml to have an explicitly stated machine type.

That is to say, I think it is still correct to retrieve the machine type from the VM spec, and there already exists logic that will catch any machine types that do not fit the "pc-q35-rhelx.x.x" version, so VMs with a machine type of just "q35" will be ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so VMs with a machine type of just "q35" will be ignored

In my opinion this is a bug.
What if the VMIs I've created were created at a time where 8.2.0 was the default and latest version? Then you're going to ignore these, although they need to be updated.

Am I missing anything?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. I still am not sure how to access the VMI status.machine.type because I just don't see it in any of the documentation anywhere, but it might be good then in the case of the machine type being q35 to update the machine type anyway to be safe.

machineTypeSubstrings := strings.Split(machineType, "-")
version := machineTypeSubstrings[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that every VMI would have a machine type of this format?
I've created a /examples/vmi-fedora.yaml vmi and saw:

spec:
  domain:
    machine:
      type: q35

In this case this job would crash.
I'd recommend to validate it's the right format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see

  machine:
    type: pc-q35-rhel9.2.0

under status though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so after some investigation I think I understand what needs to be done exactly.
Let's take the following as an example:

spec:
  domain:
    machine:
      type: q35
status:
  machine:
    type: pc-q35-rhel8.2.0

So, what would need to be done is:

  1. Parse status.machine.type, specifically the last bit (split by - like you did, take the 3rd string, split by ., take the first string (which in this case is 8), then parse it to a number and check if it's <= 8, like in this case)*.
  2. If the machine type is of rhel8 or below, update spec.domain.machine.type to a RHEL9 version**.
  3. Add the label.

Open questions:
* Can we count on the fact that all machine types at the status level look like pc-q35-rhel8.2.0? Do we support other machine types with different formats?
** Which RHEL version should we update to? 9.0.0? 9.2.0?

if len(machineTypeSubstrings) != 3 {
return nil
}

if strings.Contains(version, "rhel") && version < "rhel9.0.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you try to achieve with version < "rhel9.0.0"? Did you mean version != "rhel9.0.0"?
What if the version is "rhel9.2.0" or something similar?
Do all machine types have a version?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the requirements for updating the machine types is updating the machine type of any VM with a machine type that is prior to RHEL 9. There are some other formats of machine type I believe so I will need validate to prevent this from crashing though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change version < "rhel9.0.0".
Basically, we need to check if the x-stream version (9 in this case) is >8. In other words, if it's less or equal to 8 (e.g. 8.4.0) it needs to be updated

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Since we only want to update machine types with versions prior to the specified version (in this case RHEL 9), wouldn't a lexographical comparison of the versions be sufficient? Unless there is a format of machine type that I am missing.

In this case, "rhel8.4.0" < "rhel9.0.0" returns true, which satisfies the conditions I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you have a point.

As long as we're sure this is indeed always the format, your approach is completely valid. Saying that, it is never too bad to add some more validations, as this can go silently wrong if something changes (what if the string is empty for some reason? or if the format changes slightly, like rhel-8.0.0 instead of rhel8.0.0).

But again, if you're certain that this is always the format, feel free to leave it as it 👍

machineTypeSubstrings[2] = latestMachineTypeVersion
machineType = strings.Join(machineTypeSubstrings, "-")
updateMachineType := fmt.Sprintf(`{"spec": {"template": {"spec": {"domain": {"machine": {"type": "%s"}}}}}}`, machineType)

_, err = virtCli.VirtualMachine(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(updateMachineType), &k8sv1.PatchOptions{})
if err != nil {
return err
}

// add label to running VMs that a restart is required for change to take place
if vm.Status.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if vm.Status.Ready {
if vm.Status.Created {

There's a slight difference between Created, which means that a VMI object exists for this VM, and Ready, which means that the VMI is up and ready (e.g. booted, connected, etc).

If we use Ready here we might miss a VM that had just started but did not fully boot yet. However, at this point, the machine type is already determined.

// adding the warning label to the VMs regardless if we restart them now or if the user does it manually
// shouldn't matter, since the deletion of the VMI will remove the label and remove the vmi list anyway
err = addWarningLabel(virtCli, &vm)
if err != nil {
return err
}

if restartNow {
err = virtCli.VirtualMachine(vm.Namespace).Restart(vm.Name, &k6tv1.RestartOptions{})
if err != nil {
return err
}
}
}
}
}
return nil
}

listWatcher := cache.NewListWatchFromClient(virtCli.RestClient(), "virtualmachineinstances", k8sv1.NamespaceAll, fields.Everything())
vmiInformer := cache.NewSharedIndexInformer(listWatcher, &k6tv1.VirtualMachineInstance{}, 1*time.Hour, cache.Indexers{})
func addWarningLabel (virtCli kubecli.KubevirtClient, vm *k6tv1.VirtualMachine) error {
addLabel := fmt.Sprint(`{"metadata": {"labels": {"restart-vm-required": "true"}}}}}`)
_, err := virtCli.VirtualMachineInstance(vm.Namespace).Patch(vm.Name, types.StrategicMergePatchType, []byte(addLabel), &k8sv1.PatchOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we better add the label to the VM object instead of the VMI.
There are two reasons for that:

  1. Technically, a VMI object cannot be restarted, only a VM object
  2. The user-facing object is the VM object. This is meaningful, since most users are aware (in terms of monitoring, querying data, etc) of the VM object, and the VMI object is treated as an "implementation detail", similarly to the virt-launcher pod.

WDYT?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I think this might have been what I was originally planning to do (I can't remember why I decided not to but I can't think of a good reason for it).

if err != nil {
return err
}

// get VM name in the format namespace/name
vmKey, err := cache.MetaNamespaceKeyFunc(vm)
if err != nil {
return err
}
vmisPendingUpdate[vmKey] = struct{}{}

return nil
}

return vmiInformer, nil
func handleVmiUpdate(oldObj interface{}, newObj interface{}) {
vmi, ok := newObj.(*k6tv1.VirtualMachineInstance)
if !ok {
return
}

// check VMI status for Succeeded
if vmi.Status.Phase != k6tv1.Succeeded {
return
}

removeWarningLabel(vmi)
}

func handleVmiDeletion(obj interface{}) {
vmi, ok := obj.(*k6tv1.VirtualMachineInstance)
if !ok {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log / print an error in this case, since it never should happen

}

removeWarningLabel(vmi)
}

func removeWarningLabel(vmi *k6tv1.VirtualMachineInstance) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd make the function name a bit more accurate, since it doesn't only remove the label, but also removed the VMI from the map, ends the job if necessary, etc.

WDYT about something like handleDeletedVmi?


// get VMI name in the format namespace/name
vmiKey, err := cache.MetaNamespaceKeyFunc(vmi)
if err != nil {
fmt.Println(err)
return
}

//check if deleted VMI is in list of VMIs that need to be restarted
_, exists := vmisPendingUpdate[vmiKey]
if !exists {
return
}

// remove deleted VMI from list
delete(vmisPendingUpdate, vmiKey)

// check if VMI list is now empty, to signal exiting the job
if len(vmisPendingUpdate) == 0 {
close(exitJob)
}
}
43 changes: 38 additions & 5 deletions main.go
Original file line number Diff line number Diff line change
@@ -1,16 +1,49 @@
package mass_machine_type_transition

import "os"
import (
"fmt"
"os"
"strconv"
)

func main() {
vmiInformer, err := getVmiInformer()
var err error
// update restartNow if env is set
restartEnv, exists := os.LookupEnv("RESTART_NOW")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
restartEnv, exists := os.LookupEnv("RESTART_NOW")
restartEnv, exists := os.LookupEnv("FORCE_RESTART")

if exists {
restartNow, err = strconv.ParseBool(restartEnv)
if err != nil {
fmt.Println(err)
}
}

// update namespace if env is set
namespaceEnv, exists := os.LookupEnv("NAMESPACE")
if exists {
namespace = namespaceEnv
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👍

virtCli, err := getVirtCli()
if err != nil {
os.Exit(1)
}

if vmiInformer != nil {

vmiInformer, err := getVmiInformer(virtCli)
if err != nil {
os.Exit(1)
}


exitJob = make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already initialized above:

var (
	vmisPendingUpdate = make(map[string]struct{})
	exitJob = make(chan struct{})
...

go vmiInformer.Run(exitJob)


err = updateMachineTypes(virtCli)
if err != nil {
fmt.Println(err)
}

// wait for list of VMIs that need restart to be empty
<-exitJob

os.Exit(0)
}
14 changes: 14 additions & 0 deletions mass_machine_type_transition_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: batch/v1
kind: Job
metadata:
name: mass-machine-type-transition
spec:
template:
metadata:
name: mass-machine-type-transition
spec:
containers:
- name: machine-type-transition
image: machine-type-transition:latest
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the environment variables with default values to the job manifest. It will help guide users to use them properly.

apiVersion: batch/v1
kind: Job
metadata:
  name: mass-machine-type-transition
spec:
  template:
    metadata:
      name: mass-machine-type-transition
    spec:
      containers:
      - name: machine-type-transition
        image: machine-type-transition:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: NAMESPACE
          value: ""  # default is having no selector
        - name: RESTART_NOW
          value: false
      restartPolicy: OnFailure

45 changes: 45 additions & 0 deletions massmachinetypejob.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
package mass_machine_type_transition
/*
import (
"context"
//"strings"

batchv1 "k8s.io/api/batch/v1"
v1 "k8s.io/api/core/v1"
k8sv1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"kubevirt.io/client-go/kubecli"
)


// leftover from nested job approach
func generateMachineTypeTransitionJob(virtCli kubecli.KubevirtClient, namespace *string, completionCount *int32) error {
jobs := virtCli.BatchV1().Jobs("default")
jobSpec := &batchv1.Job{
ObjectMeta: k8sv1.ObjectMeta{
Name: "mass-machine-type-transition",
Namespace: *namespace,
},
Spec: batchv1.JobSpec{
Completions: completionCount,
Template: v1.PodTemplateSpec{
Spec: v1.PodSpec{
Containers: []v1.Container{
{
Name: "mass-machine-type-transition",
// not 100% sure what to use for image here
Image: "",
Command: "",
},
},
RestartPolicy: v1.RestartPolicyOnFailure,
},
},
},
}

_, err := jobs.Create(context.TODO(), jobSpec, k8sv1.CreateOptions{})
if err != nil {
return err
}
return nil
}*/