Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Fast Deploy (Experimental) #823

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

akutz
Copy link
Collaborator

@akutz akutz commented Dec 12, 2024

What does this PR do, and why is it needed?

This patch adds support for the Fast Deploy feature, i.e. the ability to quickly provision a VM as a linked clone, as an experimental feature that must be enabled manually. There are many things about this feature that may change prior to it being ready for production.

The patch notes below are broken down into several sections:

  • Goals -- What is currently supported
  • Non-goals -- What is not on the table right now
  • Architecture
    • Activation -- How to enable this experimental feature
    • Placement -- Request datastore recommendations
    • Disk cache -- Per-datastore cache for Content Library item disk(s)
    • Create VM -- Create linked clone directly from cached disk

Goals

The following goals are what is considered in-scope for this experimental feature at this time. Just because something is not listed, it does not mean it will not be added before the feature is made generally available:

  • Support all VM images that are OVFs
  • Support multiple zones
  • Support workload-domain isolation
  • Support all datastore types, including host-local and vSAN

Non-goals

The following is a list of non-goals that are not in scope at this time, although most of them should be revisited prior to this feature graduating to production:

  • Support VM encryption

    Child disks can only be encrypted if their parent disks are encrypted. Users could deploy an encrypted VM without using Fast Deploy, and then publish that VM as an image to then be used as the source for provisioning encrypted VMs using Fast Deploy.

    However, child disks must also use the same encryption key as their parent disks. This limitation flies in the face of the upcoming Bring Your Own Key (BYOK) provider feature.

    To accommodate this feature, online disk promotion will be an option once the VM is deployed. This means VMs will be deployed linked clones, privy to the deploy speed a linked clone affords. However, once the VM is created, even if it is powered on, its disks will be promoted so they no longer point back to their parents. While the VM will no longer be save the storage space a linked clone offers, the VM will also be able to support encryption.

  • Support VM images that are VM templates (VMTX)

    The architecture behind Fast Deploy makes it trivial to support deploying VM images that point to VM templates. While not in scope at this time, it is likely this becomes part of the feature prior to it graduating to production-ready.

  • Support for backup/restore

    The qualified backup/restore workflows for VM Service VMs have never been validated with linked clones as they have not been supported by VM Service up until this point.

    Due to how the linked clones are created in this feature, users should not expect existing backup/restore software to work with VMs provisioned with Fast Deploy at this time.

    To accommodate this feature, online disk promotion will be an option once the VM is deployed. This means VMs will be deployed linked clones, privy to the deploy speed a linked clone affords. However, once the VM is created, even if it is powered on, its disks will be promoted so they no longer point back to their parents. While the VM will no longer be save the storage space a linked clone offers, the VM will also be able to support backup/restore.

  • Support for site replication

    Similar to backup/restore, site replication workflows may not work with linked clones from bare disks either.

    To accommodate this feature, online disk promotion will be an option once the VM is deployed. This means VMs will be deployed linked clones, privy to the deploy speed a linked clone affords. However, once the VM is created, even if it is powered on, its disks will be promoted so they no longer point back to their parents. While the VM will no longer be save the storage space a linked clone offers, the VM will also be able to support site replication.

  • Support for datastore maintenance/migration

    Existing datastore maintenance/migration workflows may not be aware of or know how to handle the top-level .contentlib-cache directories created to cache disks from Content Library items on recommended datastores.

    To accommodate this feature, the goal is to transition the cached disks to be First Class Disks (FCD), but that requires some features not yet available to FCDs, such as the ability to query for the existence of an FCD based on its metadata.

Architecture

The architecture is broken down into the following sections:

  • Activation -- How to enable this experimental feature
  • Placement -- Request datastore recommendations
  • Disk cache -- Per-datastore cache for Content Library item disk(s)
  • Create VM -- Create linked clone directly from cached disk

Activation

Enabling the experimental Fast Deploy feature requires setting the environment variable FSS_WCP_VMSERVICE_FAST_DEPLOY to true in the VM Operator deployment.

Please note, even when the feature is activated, it is possible to bypass the feature altogether by specifying the following annotation on a VM: vmoperator.vmware.com/fast-deploy: "false". This annotation is completely ignored unless the feature is already activated via environment variable as described above.

Placement

The following steps provide a broad overview of how placement works:

  1. The ConfigSpec used to create/place the VM now includes:

    1. The disks and controllers used by the disks from the image.

      The disks also specify the VM spec's storage class's underlying storage policy ID.

    2. The image's guest ID if none was specified by the VM class or VM spec.

    3. The root VMProfile now specifies the VM spec's storage class's underlying storage policy ID

  2. A placement recommendation for datastores is always required, which uses the storage policies specified in the ConfigSpec to recommend a compatible datastore.

  3. A path is constructed that points to where the VM will be created on the recommended datastore, ex.: [<DATASTORE>] <KUBE_VM_OBJ_UUID>/<KUBE_VM_NAME>.vmx

Disk cache

The disk(s) from a Content Library item are cached on-demand on the
recommended datastore:

  1. The path(s) to the image's VMDK file(s) from the underlying Content Library Item are retrieved.

  2. A special, top-level directory named .contentlib-cache is created, if it does not exist, at the root of the recommended datastore.

    Please note, this does support vSAN and thus the top-level directory may actually be a UUID that is resolved to .contentlib-cache.

  3. A path is constructed that points to where the disk(s) for the library item are expected to be cached on the recommended datastore, ex.: [<DATASTORE>] .contentlib-cache/<LIB_ITEM_ID>/<LIB_ITEM_CONTENT_VERSION>

    If this path does not exist, it is created.

  4. The following occurs for each of the library item's VMDK files:

    1. The first 17 characters of a SHA-1 sum of the VMDK file name are used to build the expected path to the VMDK file's cached location on the recommended datastore, ex.: [<DATASTORE>] .contentlib-cache/<LIB_ITEM_ID>/<LIB_ITEM_CONTENT_VERSION>/<17_CHAR_SHA1_SUM>.vmdk

    2. If there is no VMDK at the above path, the VMDK file is copied to the above path.

The cached disks and entire cache folder structure are automatically removed once there are no longer any VMs deployed as linked clones using a cached disk.

This will likely change in the future to prevent the need to re-cache a disk just because the VMs deployed from it are no longer using it. Otherwise disks may need to be continuously cached, which reduces the value this feature provides.

Create VM

  1. The VirtualDisk devices in the ConfigSpec used to create the VM are updated with VirtualDiskFlatVer2BackingInfo backings that specify a parent backing which refers to the cached, base disk from above.

    The path to each of the VM's disks is constructed based on the index of the disk, ex.: [<DATASTORE>] <KUBE_VM_OBJ_UUID>/<KUBE_VM_NAME>-<DISK_INDEX>.vmdk.

  2. The CreateVM_Task VMODL1 API is used to create the VM. Because the the VM's disks have parent backings, this new VM is effectively a linked clone.

Which issue(s) is/are addressed by this PR? (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes NA

Are there any special notes for your reviewer:

Please add a release note if necessary:

Experimental support for Fast Deploy (linked clones)

@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines. label Dec 12, 2024
@akutz akutz requested a review from bryanv December 12, 2024 21:36
@akutz akutz force-pushed the feature/fast-deploy-internal-poc branch 9 times, most recently from 75d6b44 to db5ac02 Compare December 16, 2024 18:28
This patch adds support for the Fast Deploy feature, i.e. the ability to
quickly provision a VM as a linked clone, as an experimental feature
that must be enabled manually. There are many things about this feature
that may change prior to it being ready for production.

The patch notes below are broken down into several sections:

* Goals          -- What is currently supported
* Non-goals      -- What is not on the table right now
* Architecture
    * Activation -- How to enable this experimental feature
    * Placement  -- Request datastore recommendations
    * Disk cache -- Per-datastore cache for Content Library item disk(s)
    * Create VM  -- Create linked clone directly from cached disk

-~= Goals =~-

The following goals are what is considered in-scope for this
experimental feature at this time. Just because something is not listed,
it does not mean it will not be added before the feature is made
generally available:

* Support all VM images that are OVFs
* Support multiple zones
* Support workload-domain isolation
* Support all datastore types, including host-local and vSAN

-~= Non-goals =~-

The following is a list of non-goals that are not in scope at this time,
although most of them should be revisited prior to this feature
graduating to production:

* Support VM encryption

    Child disks can only be encrypted if their parent disks are
    encrypted. Users *could* deploy an encrypted VM without using Fast
    Deploy, and then publish that VM as an image to then be used as the
    source for provisioning encrypted VMs using Fast Deploy.

    However, child disks must also use the same encryption key as their
    parent disks. This limitation flies in the face of the upcoming
    Bring Your Own Key (BYOK) provider feature.

    To accommodate this feature, online disk promotion will be an option
    once the VM is deployed. This means VMs will be deployed linked
    clones, privy to the deploy speed a linked clone affords. However,
    once the VM is created, even if it is powered on, its disks will be
    promoted so they no longer point back to their parents. While the VM
    will no longer be save the storage space a linked clone offers, the
    VM will also be able to support encryption.

* Support VM images that are VM templates (VMTX)

    The architecture behind Fast Deploy makes it trivial to support
    deploying VM images that point to VM templates. While not in scope
    at this time, it is likely this becomes part of the feature prior
    to it graduating to production-ready.

* Support for backup/restore

    The qualified backup/restore workflows for VM Service VMs have never
    been validated with linked clones as they have not been supported by
    VM Service up until this point.

    Due to how the linked clones are created in this feature, users
    should not expect existing backup/restore software to work with VMs
    provisioned with Fast Deploy at this time.

    To accommodate this feature, online disk promotion will be an option
    once the VM is deployed. This means VMs will be deployed linked
    clones, privy to the deploy speed a linked clone affords. However,
    once the VM is created, even if it is powered on, its disks will be
    promoted so they no longer point back to their parents. While the VM
    will no longer be save the storage space a linked clone offers, the
    VM will also be able to support backup/restore.

* Support for site replication

    Similar to backup/restore, site replication workflows may not work
    with linked clones from bare disks either.

    To accommodate this feature, online disk promotion will be an option
    once the VM is deployed. This means VMs will be deployed linked
    clones, privy to the deploy speed a linked clone affords. However,
    once the VM is created, even if it is powered on, its disks will be
    promoted so they no longer point back to their parents. While the VM
    will no longer be save the storage space a linked clone offers, the
    VM will also be able to support site replication.

* Support for datastore maintenance/migration

    Existing datastore maintenance/migration workflows may not be aware
    of or know how to handle the top-level `.contentlib-cache`
    directories created to cache disks from Content Library items on
    recommended datastores.

    To accommodate this feature, the goal is to transition the cached
    disks to be First Class Disks (FCD), but that requires some features
    not yet available to FCDs, such as the ability to query for the
    existence of an FCD based on its metadata.

-~= Architecture =~-

The architecture is broken down into the following sections:

* Activation -- How to enable this experimental feature
* Placement  -- Request datastore recommendations
* Disk cache -- Per-datastore cache for Content Library item disk(s)
* Create VM  -- Create linked clone directly from cached disk

--~~== Activation ==~~--

Enabling the experimental Fast Deploy feature requires setting the
environment variable `FSS_WCP_VMSERVICE_FAST_DEPLOY` to `true` in the
VM Operator deployment.

Please note, even when the feature is activated, it is possible to
bypass the feature altogether by specifying the following annotation
on a VM: `vmoperator.vmware.com/fast-deploy: "false"`. This annotation
is completely ignored unless the feature is already activated via
environment variable as described above.

--~~== Placement ==~~--

The following steps provide a broad overview of how placement works:

1. The ConfigSpec used to create/place the VM now includes:

    a. The disks and controllers used by the disks from the image.

       The disks also specify the VM spec's storage class's underlying
       storage policy ID.

    b. The image's guest ID if none was specified by the VM class or VM
       spec.

    c. The root `VMProfile` now specifies the VM spec's storage class's
       underlying storage policy ID

2. A placement recommendation for datastores is always required, which
   uses the storage policies specified in the ConfigSpec to recommend
   a compatible datastore.

3. A path is constructed that points to where the VM will be created
   on the recommended datastore, ex.:
   `[<DATASTORE>] <KUBE_VM_OBJ_UUID>/<KUBE_VM_NAME>.vmx`

--~~== Disk cache ==~~--

The disk(s) from a Content Library item are cached on-demand on the
recommended datastore:

1. The path(s) to the image's VMDK file(s) from the underlying Content
   Library Item are retrieved.

2. A special, top-level directory named `.contentlib-cache` is created,
   if it does not exist, at the root of the recommended datastore.

   Please note, this does support vSAN and thus the top-level directory
   may actually be a UUID that is resolved to `.contentlib-cache`.

3. A path is constructed that points to where the disk(s) for the
   library item are expected to be cached on the recommended datastore,
   ex.:
   `[<DATASTORE>] .contentlib-cache/<LIB_ITEM_ID>/<LIB_ITEM_CONTENT_VERSION>`

   If this path does not exist, it is created.

4. The following occurs for each of the library item's VMDK files:

    a. The first 17 characters of a SHA-1 sum of the VMDK file name are
       used to build the expected path to the VMDK file's cached
       location on the recommended datastore, ex.:
       `[<DATASTORE>] .contentlib-cache/<LIB_ITEM_ID>/<LIB_ITEM_CONTENT_VERSION>/<17_CHAR_SHA1_SUM>.vmdk`

    b. If there is no VMDK at the above path, the VMDK file is copied to
       the above path.

The cached disks and entire cache folder structure are automatically
removed once there are no longer any VMs deployed as linked clones using
a cached disk.

This will likely change in the future to prevent the need to re-cache a
disk just because the VMs deployed from it are no longer using it.
Otherwise disks may need to be continuously cached, which reduces the
value this feature provides.

--~~== Create VM ==~~--

1. The `VirtualDisk` devices in the ConfigSpec used to create the VM are
   updated with `VirtualDiskFlatVer2BackingInfo` backings that specify a
   parent backing.

   This parent backing points to the appropriate, cached, base disk from
   above.

2. The `CreateVM_Task` VMODL1 API is used to create the VM. Because the
   the VM's disks have parent backings, this new VM is effectively a
   linked clone.
@akutz akutz force-pushed the feature/fast-deploy-internal-poc branch from db5ac02 to 4f04b1a Compare December 19, 2024 18:49
Copy link

Code Coverage

Package Line Rate Health
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/clustercontentlibraryitem 82%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/contentlibraryitem 86%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/utils 97%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/configmap 86%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/crd 93%
github.com/vmware-tanzu/vm-operator/controllers/infra/configmap 71%
github.com/vmware-tanzu/vm-operator/controllers/infra/node 77%
github.com/vmware-tanzu/vm-operator/controllers/infra/secret 77%
github.com/vmware-tanzu/vm-operator/controllers/infra/validatingwebhookconfiguration 85%
github.com/vmware-tanzu/vm-operator/controllers/infra/zone 76%
github.com/vmware-tanzu/vm-operator/controllers/storageclass 95%
github.com/vmware-tanzu/vm-operator/controllers/storagepolicyquota 97%
github.com/vmware-tanzu/vm-operator/controllers/util/encoding 73%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/storagepolicyusage 99%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/virtualmachine 84%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/volume 86%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineclass 75%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinepublishrequest 81%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinereplicaset 67%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice 83%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice/providers 92%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinesetresourcepolicy 80%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/conditions 88%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/patch 78%
github.com/vmware-tanzu/vm-operator/pkg/bitmask 100%
github.com/vmware-tanzu/vm-operator/pkg/builder 95%
github.com/vmware-tanzu/vm-operator/pkg/conditions 88%
github.com/vmware-tanzu/vm-operator/pkg/config 100%
github.com/vmware-tanzu/vm-operator/pkg/config/capabilities 100%
github.com/vmware-tanzu/vm-operator/pkg/config/env 100%
github.com/vmware-tanzu/vm-operator/pkg/context/generic 100%
github.com/vmware-tanzu/vm-operator/pkg/context/operation 100%
github.com/vmware-tanzu/vm-operator/pkg/patch 78%
github.com/vmware-tanzu/vm-operator/pkg/prober 91%
github.com/vmware-tanzu/vm-operator/pkg/prober/probe 90%
github.com/vmware-tanzu/vm-operator/pkg/prober/worker 77%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere 72%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/client 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/clustermodules 71%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/config 89%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/contentlibrary 74%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/credentials 100%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/network 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/placement 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/session 71%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/storage 44%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/sysprep 100%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vcenter 81%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/virtualmachine 84%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vmlifecycle 65%
github.com/vmware-tanzu/vm-operator/pkg/record 87%
github.com/vmware-tanzu/vm-operator/pkg/topology 91%
github.com/vmware-tanzu/vm-operator/pkg/util 88%
github.com/vmware-tanzu/vm-operator/pkg/util/annotations 100%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit 89%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit/validate 91%
github.com/vmware-tanzu/vm-operator/pkg/util/image 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube 89%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/cource 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/proxyaddr 73%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/spq 100%
github.com/vmware-tanzu/vm-operator/pkg/util/netplan 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache 75%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/paused 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ptr 100%
github.com/vmware-tanzu/vm-operator/pkg/util/resize 97%
github.com/vmware-tanzu/vm-operator/pkg/util/vmopv1 80%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/client 64%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/library 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/vm 79%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/watcher 87%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig 95%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/crypto 98%
github.com/vmware-tanzu/vm-operator/pkg/webconsolevalidation 100%
github.com/vmware-tanzu/vm-operator/services/vm-watcher 92%
github.com/vmware-tanzu/vm-operator/webhooks/common 100%
github.com/vmware-tanzu/vm-operator/webhooks/persistentvolumeclaim/validation 95%
github.com/vmware-tanzu/vm-operator/webhooks/unifiedstoragequota/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/mutation 87%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/validation 95%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/mutation 62%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinepublishrequest/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinereplicaset/validation 90%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/mutation 67%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinesetresourcepolicy/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/v1alpha1/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/validation 92%
Summary 82% (10915 / 13232)

Minimum allowed line rate is 79%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-not-required size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants