You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently XDE creates and deletes OPTE virtual-to-physical mapping information for a guest IP when a caller asks to create or destroy an XDE port with that IP. In an Omicron-managed environment, this happens whenever sled agent creates a VMM on a sled (the VMM initialization process creates a port for each of the instance's NICs so that it can pass the ports to Propolis to use as network backends) and whenever a Propolis VMM shuts down (Propolis VM shutdown triggers a sled agent destruction sequence that destroys all of the VMM's associated ports).
This behavior combines with the (WIP) Nexus live migration protocol in ways that make it somewhat challenging for Nexus to ensure that all sleds have the latest mappings for a migrating instance. Some of the possible races are described in the fold below.
It would be lovely, from the perspective of keeping our control plane synchronization as simple as possible, if Nexus and sled agent could assume that (or ask to be able to assume that) they fully control the V2P mappings on each sled and that mappings won't be created or destroyed without an explicit command from the control plane.
The gory details
Nexus's LM protocol generally tries to avoid monitoring or tightly synchronizing with ongoing migration work on any particular sled. That is, the migration saga does just enough to initiate migration and then exits; when the migration succeeds or fails, the sled agents involved push instance runtime state updates that indicate which sled the instance ended up running on, with no additional coordination with Nexus (or the other sled agent!) required. This is nice for robustness (it minimizes the number of parties who have to send messages successfully in order for migration to succeed) but meshes with XDE's creation/deletion of ports in interesting ways:
If an instance starts migrating and fails, Nexus has to make sure to push "instance is on S" mappings to sled T even though the instance didn't ultimately move (because merely creating a VMM on T created "instance is on T" mappings for its VIPs).
Nexus would like to propagate "instance is on T" mappings as soon as it learns that an instance has successfully migrated, because until it does, other instances in the same VPC might not be able to reach it. This creates a race, though:
Instance migrates from S to T
Nexus pushes "instance is on T" mappings to S
The source VMM on S shuts down, destroying all the XDE ports
S deletes the mappings associated with the instance's VIPs, but it should keep the updated mappings instead
This latter problem is particularly thorny because Nexus would really like to avoid waiting for an extra message from the source (that might be delayed, or indeed never arrive at all) before starting to propagate mappings for an instance that it knows has successfully migrated.
There's a case to be made for having Nexus propagate V2P mappings using a reliable persistent workflow that knows how to correct itself if it applies stale mappings. (That is: RPW task reads that an instance is on sled A at generation 4; it starts propagating mappings; meanwhile the instance moves to sled B at generation 5; before exiting, the task notices the generation/location change and does another lap to make sure the right mappings are propagated.) The main downside here is that, depending on when the task runs and what it observes when, it can temporarily destroy otherwise-correct mappings and damage connectivity to an instance. (The LM protocol otherwise avoids this problem by ensuring that a migrating instance can't migrate again until any configuration affected by the migration has been updated.)
The text was updated successfully, but these errors were encountered:
Currently XDE creates and deletes OPTE virtual-to-physical mapping information for a guest IP when a caller asks to create or destroy an XDE port with that IP. In an Omicron-managed environment, this happens whenever sled agent creates a VMM on a sled (the VMM initialization process creates a port for each of the instance's NICs so that it can pass the ports to Propolis to use as network backends) and whenever a Propolis VMM shuts down (Propolis VM shutdown triggers a sled agent destruction sequence that destroys all of the VMM's associated ports).
This behavior combines with the (WIP) Nexus live migration protocol in ways that make it somewhat challenging for Nexus to ensure that all sleds have the latest mappings for a migrating instance. Some of the possible races are described in the fold below.
It would be lovely, from the perspective of keeping our control plane synchronization as simple as possible, if Nexus and sled agent could assume that (or ask to be able to assume that) they fully control the V2P mappings on each sled and that mappings won't be created or destroyed without an explicit command from the control plane.
The gory details
Nexus's LM protocol generally tries to avoid monitoring or tightly synchronizing with ongoing migration work on any particular sled. That is, the migration saga does just enough to initiate migration and then exits; when the migration succeeds or fails, the sled agents involved push instance runtime state updates that indicate which sled the instance ended up running on, with no additional coordination with Nexus (or the other sled agent!) required. This is nice for robustness (it minimizes the number of parties who have to send messages successfully in order for migration to succeed) but meshes with XDE's creation/deletion of ports in interesting ways:
This latter problem is particularly thorny because Nexus would really like to avoid waiting for an extra message from the source (that might be delayed, or indeed never arrive at all) before starting to propagate mappings for an instance that it knows has successfully migrated.
There's a case to be made for having Nexus propagate V2P mappings using a reliable persistent workflow that knows how to correct itself if it applies stale mappings. (That is: RPW task reads that an instance is on sled A at generation 4; it starts propagating mappings; meanwhile the instance moves to sled B at generation 5; before exiting, the task notices the generation/location change and does another lap to make sure the right mappings are propagated.) The main downside here is that, depending on when the task runs and what it observes when, it can temporarily destroy otherwise-correct mappings and damage connectivity to an instance. (The LM protocol otherwise avoids this problem by ensuring that a migrating instance can't migrate again until any configuration affected by the migration has been updated.)
The text was updated successfully, but these errors were encountered: