Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration Failed, Removed NICs, Couldn't Migrate without #90

Open
Smithx10 opened this issue Mar 18, 2023 · 2 comments
Open

Migration Failed, Removed NICs, Couldn't Migrate without #90

Smithx10 opened this issue Mar 18, 2023 · 2 comments

Comments

@Smithx10
Copy link

I went to Migrate a VM to a newly provision CN, in the process we failed on the first attempt due to imgadm being misconfigured on the target CN.

Error from Migrate Begin:

  "migrationTask": {
    "action": "begin",
    "record": {
      "action": "begin",
      "automatic": true,
      "created_timestamp": "2023-03-18T03:14:48.264Z",
      "id": "869c15a8-bd90-4a11-a0c9-763c0cc3bbda",
      "num_sync_phases": 0,
      "phase": "begin",
      "progress_history": [
        {
          "type": "progress",
          "message": "reserving instance failed - \"{\\\"name\\\":\\\"imgadm\\\",\\\"req_id\\\":\\\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\\\",\\\"hostname\\\":\\\"3c-ec-ef-d8-14-da\\\",\\\"pid\\\":140051,\\\"level\\\":20,\\\"opts\\\":{},\\\"args\\\":[\\\"import\\\",\\\"-q\\\",\\\"-P\\\",\\\"zones\\\",\\\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\\\"],\\\"cli\\\":true,\\\"msg\\\":\\\"cli init\\\",\\\"time\\\":\\\"2023-03-18T03:14:58.289Z\\\",\\\"v\\\":0}",
          "phase": "begin",
          "state": "failed",
          "started_timestamp": "2023-03-18T03:14:53.083Z",
          "current_progress": 1,
          "total_progress": 100,
          "job_uuid": "c922d71f-6781-46f4-b594-cd5f2e7e8ac3",
          "errorDetail": "\"{\\\"name\\\":\\\"imgadm\\\",\\\"req_id\\\":\\\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\\\",\\\"hostname\\\":\\\"3c-ec-ef-d8-14-da\\\",\\\"pid\\\":140051,\\\"level\\\":20,\\\"opts\\\":{},\\\"args\\\":[\\\"import\\\",\\\"-q\\\",\\\"-P\\\",\\\"zones\\\",\\\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\\\"],\\\"cli\\\":true,\\\"msg\\\":\\\"cli init\\\",\\\"time\\\":\\\"2023-03-18T03:14:58.289Z\\\",\\\"v\\\":0}\\n{\\\"name\\\":\\\"imgadm\\\",\\\"req_id\\\":\\\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\\\",\\\"hostname\\\":\\\"3c-ec-ef-d8-14-da\\\",\\\"pid\\\":140051,\\\"level\\\":30,\\\"uuid\\\":\\\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\\\",\\\"arg\\\":\\\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\\\",\\\"msg\\\":\\\"image-id validated\\\",\\\"time\\\":\\\"2023-03-18T03:14:58.298Z\\\",\\\"v\\\":0}\\n{\\\"name\\\":\\\"imgadm\\\",\\\"req_id\\\":\\\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\\\",\\\"hostname\\\":\\\"3c-ec-ef-d8-14-da\\\",\\\"pid\\\":140051,\\\"level\\\":20,\\\"subcmd\\\":\\\"import\\\",\\\"exitStatus\\\":1,\\\"cli\\\":true,\\\"msg\\\":\\\"cli exit\\\",\\\"time\\\":\\\"2023-03-18T03:15:13.531Z\\\",\\\"v\\\":0}\\nimgadm import: error (ENOTFOUND): Error: getaddrinfo ENOTFOUND\\n    at errnoException (dns.js:37:11)\\n    at Object.onanswer [as oncomplete] (dns.js:124:16)\"",
          "finished_timestamp": "2023-03-18T03:15:20.854Z",
          "error": "\"{\\\"name\\\":\\\"imgadm\\\",\\\"req_id\\\":\\\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\\\",\\\"hostname\\\":\\\"3c-ec-ef-d8-14-da\\\",\\\"pid\\\":140051,\\\"level\\\":20,\\\"opts\\\":{},\\\"args\\\":[\\\"import\\\",\\\"-q\\\",\\\"-P\\\",\\\"zones\\\",\\\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\\\"],\\\"cli\\\":true,\\\"msg\\\":\\\"cli init\\\",\\\"time\\\":\\\"2023-03-18T03:14:58.289Z\\\",\\\"v\\\":0}",
          "disallowRetry": true
        }
      ],

Error from provision:
Showing NICs existed on this first attempt:

  "filteredNetworks": {
    "netInfo": [
      {
        "family": "ipv4",
        "mtu": 1500,
        "nic_tag": "external",
        "name": "external_2",
        "provision_end_ip": "10.91.210.253",
        "provision_start_ip": "10.91.210.2",
        "subnet": "10.91.210.0/24",
        "uuid": "0f8cf346-73b6-4a55-9e32-ae0376da9cea",
        "vlan_id": 2042,
        "resolvers": [
          "10.45.137.14",
          "10.45.137.15"
        ],
        "gateway": "10.91.210.254",
        "routes": {},
        "netmask": "255.255.255.0"
      },
      {
        "family": "ipv4",
        "mtu": 9000,
        "nic_tag": "external",
        "name": "drbd-0",
        "provision_end_ip": "40.40.41.254",
        "provision_start_ip": "40.40.41.0",
        "subnet": "40.40.40.0/22",
        "uuid": "0546aa0a-2d03-4bff-82d6-4cc91cf07e74",
        "vlan_id": 2084,
        "resolvers": [],
        "routes": {},
        "description": "drbd replication network",
        "owner_uuids": [
          "b083728b-8d87-42bf-f633-be04f3038460"
        ],
        "netmask": "255.255.252.0"
      }
    ],

Root Caused Error from imgadm: (we deployed without the latest GZ tools and needed to remove unresolvable imgapi server.

    {
      "result": "",
      "error": "{\"name\":\"imgadm\",\"req_id\":\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\",\"hostname\":\"3c-ec-ef-d8-14-da\",\"pid\":140051,\"level\":20,\"opts\":{},\"args\":[\"import\",\"-q\",\"-P\",\"zones\",\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\"],\"cli\":true,\"msg\":\"cli init\",\"time\":\"2023-03-18T03:14:58.289Z\",\"v\":0}\n{\"name\":\"imgadm\",\"req_id\":\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\",\"hostname\":\"3c-ec-ef-d8-14-da\",\"pid\":140051,\"level\":30,\"uuid\":\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\",\"arg\":\"b4632bed-a140-4c5c-88c8-f9c36a2d012e\",\"msg\":\"image-id validated\",\"time\":\"2023-03-18T03:14:58.298Z\",\"v\":0}\n{\"name\":\"imgadm\",\"req_id\":\"0ab3eef9-603f-4d70-b5f4-8bd19e9cf88f\",\"hostname\":\"3c-ec-ef-d8-14-da\",\"pid\":140051,\"level\":20,\"subcmd\":\"import\",\"exitStatus\":1,\"cli\":true,\"msg\":\"cli exit\",\"time\":\"2023-03-18T03:15:13.531Z\",\"v\":0}\nimgadm import: error (ENOTFOUND): Error: getaddrinfo ENOTFOUND\n    at errnoException (dns.js:37:11)\n    at Object.onanswer [as oncomplete] (dns.js:124:16)",
      "name": "cnapi.wait_task_ensure_image",
      "started_at": "2023-03-18T03:14:56.702Z",
      "finished_at": "2023-03-18T03:15:13.678Z"
    }

After Fixing the imgadm config, I aborted the migration and ran it again this time we error on "Invalid VM parameters"
Here is filteredNetworks and nics being []

  "filteredNetworks": {
    "netInfo": [],
    "networks": [],
    "fabrics": [],
    "pools": [],
    "nics": []
  },
 "last_modified": "2023-03-18T03:15:47.000Z",
    "limit_priv": "default,-file_link_any,-net_access,-proc_fork,-proc_info,-proc_session",
    "max_locked_memory": 16640,
    "max_lwps": 4000,
    "max_physical_memory": 16640,
    "max_swap": 32768,
    "nics": [],
    "owner_uuid": "b083728b-8d87-42bf-f633-be04f3038460",
    "platform_buildstamp": "20220505T001410Z",
    "quota": 1,
    "ram": 16384,
    "resolvers": [],
    "server_uuid": "00cc1290-cb60-ea11-8000-ac1f6bbbd082",
    "snapshots": [],

Error message rightly insists we have Invalid VM parameters, but I think we can do better in Validation to say Which Key is Invalid to better point the operator into knowing which VM parameter.

      "progress_history": [
        {
          "type": "progress",
          "message": "reserving instance failed - Invalid VM parameters",
          "phase": "begin",
          "state": "failed",
          "started_timestamp": "2023-03-18T03:32:20.251Z",
          "current_progress": 1,
          "total_progress": 100,
          "job_uuid": "c696d71d-05fd-46c2-a6d6-d8c8e350d524",
          "finished_timestamp": "2023-03-18T03:32:21.486Z",
          "error": "Invalid VM parameters",
          "disallowRetry": true
        }
      ],

Here is where we wrap up errors coming from migration:
https://github.com/TritonDataCenter/sdc-vmapi/blob/master/lib/workflows/vm-migration/begin.js#L477

Here is where we handle validation and return Invalid VM Parameters and the original error up.
https://github.com/TritonDataCenter/sdc-vmapi/blob/master/lib/common/validation.js#L1371

I wasted some time trying to realize that the NICs went missing (lol), which ultimately required me to add console.log and reading the original error. I don't believe this is logged from what I can tell, might be easy for folks to waste time here.

After adding the NIC back and migrating it went through.

@bahamat
Copy link
Member

bahamat commented Mar 19, 2023

Abort tells the system to just abandon everything and let the operator deal with the state. So in that regard, it did what you asked it to.

However, the correct action in this case would have been to execute a rollback which should have moved the nics back to the original instance and cleaned up the failure on the new CN.

@Smithx10
Copy link
Author

@bahamat AHhhhhh, thanks for clearing that part up.

What cha think about improving the error to return the Parameter that was invalid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants