Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance xenorchestra_vm to provide a method for graceful termination #222

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ddelnano
Copy link
Collaborator

@ddelnano ddelnano commented Dec 14, 2022

This is a continuation of #212.

I discovered that #220 needed to be addressed to avoid a race condition that existed when the XO client inside the provide tried to shutdown Vms. In addition to this, I decided that this was better to provide as an opt in rather than the default behavior.

Testing

  • make testacc passes
  • New tests pass
  • New tests fail with previous xenorchestra_vm delete behavior
# Verify it is using previous destroy behavior
$ git diff
diff --git a/xoa/resource_xenorchestra_vm.go b/xoa/resource_xenorchestra_vm.go
index 74a86ec..71c97da 100644
--- a/xoa/resource_xenorchestra_vm.go
+++ b/xoa/resource_xenorchestra_vm.go
@@ -837,22 +837,22 @@ func resourceVmUpdate(d *schema.ResourceData, m interface{}) error {

 func resourceVmDelete(d *schema.ResourceData, m interface{}) error {
        c := m.(client.XOClient)
-       gracefulTermination := d.Get("use_graceful_termination").(bool)
+       // gracefulTermination := d.Get("use_graceful_termination").(bool)
        vmId := d.Id()

-       if gracefulTermination {
-               vm, err := c.GetVm(client.Vm{Id: vmId})
-               if err != nil {
-                       return err
-               }
-
-               if vm.PowerState == "Running" {
-                       err = c.HaltVm(vmId)
-                       if err != nil {
-                               return err
-                       }
-               }
-       }
+       // if gracefulTermination {
+       //      vm, err := c.GetVm(client.Vm{Id: vmId})
+       //      if err != nil {
+       //              return err
+       //      }
+
+       //      if vm.PowerState == "Running" {
+       //              err = c.HaltVm(vmId)
+       //              if err != nil {
+       //                      return err
+       //              }
+       //      }
+       // }

$ TEST=TestAccXenorchestraVm_gracefulTermination make testacc

=== RUN   TestAccXenorchestraVm_gracefulTermination
=== PAUSE TestAccXenorchestraVm_gracefulTermination
=== RUN   TestAccXenorchestraVm_gracefulTerminationForShutdownVm
=== PAUSE TestAccXenorchestraVm_gracefulTerminationForShutdownVm
=== CONT  TestAccXenorchestraVm_gracefulTermination
=== CONT  TestAccXenorchestraVm_gracefulTerminationForShutdownVm
=== CONT  TestAccXenorchestraVm_gracefulTermination
    resource_xenorchestra_vm_test.go:966: Step 2/2 error: Error running destroy: exit status 1

        Error: mock client did not receive a stopped Vm. Graceful termination was bypassed!



    testing_new.go:63: Error running post-test destroy, there may be dangling resources: exit status 1

        Error: mock client did not receive a stopped Vm. Graceful termination was bypassed!



--- FAIL: TestAccXenorchestraVm_gracefulTermination (42.79s)
--- PASS: TestAccXenorchestraVm_gracefulTerminationForShutdownVm (109.80s)
FAIL

4censord and others added 4 commits December 14, 2022 14:23
This allows the VM to terminate its work properly.
As discussed in #212

Trying to stop a VM that is not running results in an error, so we only
do shutdowns for running VMs now

Possible power-states are "Running, Halted, Paused, Suspended"
Add acceptance tests that verify if graceful termination was successful
@ddelnano
Copy link
Collaborator Author

ddelnano commented Dec 14, 2022

@4censord please review this to verify it will work for your use cause. The only functional difference should be that this functionality must be opted into with a new xenorchestra_vm resource argument -- use_graceful_termination.

After discovering #220, setting this to the default has the potential to significantly slow down the tests and the general case. From running the test suite a few times against #212, the default 2 min timeout for validating PV drivers are present causes more flakiness in the build and longer test times.

I think it should be expected that if terraform is to destroy resources that it may not be graceful. Therefore, I think making this opt in is the best solution.

@4censord
Copy link
Contributor

4censord commented Dec 15, 2022

This would work for me.

I would still argue that it should be the default behavior.

I think it should be expected that if terraform is to destroy resources that it may not be graceful. Therefore, I think making this opt in is the best solution.

Most other providers seem to work this way.

But i understand that it is not practical to do so right now

When attempting to shut down a vm without the management agent installed, this now times out.
But it does not clearly log the problem:

xenorchestra_vm.vm: Still destroying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 2m0s elapsed]
╷
│ Error: failed to gracefully halt the vm with id: 361d045e-c9cf-661d-4863-1d9f5d38681e and error: timeout while waiting for state to become 'true' (last state: 'false', timeout: 2m0s)
│ 
│ 
╵

@4censord
Copy link
Contributor

Changing the use_graceful_termination attribute takes almost 30s:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # xenorchestra_vm.vm will be updated in-place
  ~ resource "xenorchestra_vm" "vm" {
        id                       = "361d045e-c9cf-661d-4863-1d9f5d38681e"
        tags                     = [
            "dev",
        ]
      ~ use_graceful_termination = true -> false
        # (20 unchanged attributes hidden)

        # (3 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.
xenorchestra_vm.vm: Modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e]
xenorchestra_vm.vm: Still modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 10s elapsed]
xenorchestra_vm.vm: Still modifying... [id=361d045e-c9cf-661d-4863-1d9f5d38681e, 20s elapsed]
xenorchestra_vm.vm: Modifications complete after 26s [id=361d045e-c9cf-661d-4863-1d9f5d38681e]

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

@ddelnano
Copy link
Collaborator Author

Most other providers seem to work this way.

It seems vsphere does this, but uses a 3 min timeout by default. What other providers have you seen?

When attempting to shut down a vm without the management agent installed, this now times out.

#220 may need a slightly different heuristic. I'll see if the XO api exposes the use of PV drivers rather than relying on the management agent presence. I think we should handle the case were there aren't PV drivers potentially. Allowing people to opt into a forceful stop and using that if the graceful method fails or timeouts could satisfy that, but I wanted to see how XO handles things.

Changing the use_graceful_termination attribute takes almost 30s

Modifying Vms unfortunately relies on a 25 second sleep (source). That's another area where there is a race condition and not being able to detect the value has persisted causes problems for the test suite.

I don't think I can fix that in this PR, but I can look into that early next year.

@4censord
Copy link
Contributor

It seems vsphere does this, but uses a 3 min timeout by default. What other providers have you seen?

Does or attempts graceful termination:

  • hyperv
  • vsphere
  • proxmox
  • openstack

Does not do graceful termination:

  • libvirt

Changing the use_graceful_termination attribute takes almost 30s

Modifying Vms unfortunately relies on a 25 second sleep (source).

Oh, i see.

@ddelnano
Copy link
Collaborator Author

#223 will address the regression with requiring the management agent and will properly detect PV drivers

@ddelnano
Copy link
Collaborator Author

ddelnano commented Dec 15, 2022

I think I'd like to take another pass at this and implement things closer to how hyperv works:

  • Has the ability to configure a shutdown timeout
  • Has a configuration setting for forceful shutdown (something the XO provider doesn't have yet) which is used after the shutdown timeout expires

Having forceful shutdown should prevent the test suite flakiness that I experienced when setting the graceful shutdown as the default. So hopefully that should be more in line what you initially were advocating for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants