Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

jgallagher · 2024-03-01T21:36:03Z

During today's demos, I attempted to show adding a sled to madrid and seeing newly-started instances running on it. Before the demo, I started 10 instances (with only 3 sleds available), and the distribution was:

   1 BRM42220046
   5 BRM42220081
   4 BRM44220001

During the demo, I started 30 instances, and the distribution was:

  10  BRM42220046
  10  BRM42220081
   9  BRM44220001
   0  BRM42220004 (added sled)

Each of these instances wanted 8 CPUs, so at this point the CPU resource usage was:

  11 instances =  88 CPUs  BRM42220046
  15 instances = 120 CPUs  BRM42220081
  13 instances = 104 CPUs  BRM44220001

We started a call to debug why the new sled wasn't receiving instances. Without making any changes to madrid, I started 20 more instances, and they were allocated to these sleds in this order:

BRM42220081 (new total = 16)
BRM42220004 (new total =  1)
BRM42220046 (new total = 12)
BRM42220004 (new total =  2)
BRM42220046 (new total = 13)
BRM42220046 (new total = 14)
BRM42220046 (new total = 15)
BRM42220004 (new total =  3)
BRM42220004 (new total =  4)
BRM42220004 (new total =  5)
BRM44220001 (new total = 14)
BRM42220004 (new total =  6)
BRM42220004 (new total =  7)
BRM42220046 (new total = 15)
BRM42220004 (new total =  8)
BRM42220004 (new total =  9)
BRM44220001 (new total = 15)
BRM44220001 (new total = 16)
BRM42220004 (new total = 10)
BRM42220004 (new total = 11)

Once each sled hits 16, we expectedly don't choose it again, as that maxes out its CPUs at 128. Purely eyeballing, this final list of 20 looks believable if we're choosing randomly, but getting 0 out of the first 30 on the new sled is exceedingly unlikely (~0.02%, assuming a 25% chance on each instance).

The exact query we're using to select a sled for one of the above instances is:

SELECT "sled"."id" FROM ("sled" LEFT OUTER JOIN "sled_resource" ON ("sled_resource"."sled_id" = "sled"."id")) WHERE ((("sled"."time_deleted" IS NULL) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) GROUP BY "sled"."id" HAVING ((((COALESCE(SUM(CAST(hardware_threads as INT8)), 0) + 8) <= "sled"."usable_hardware_threads") AND ((COALESCE(SUM(CAST(rss_ram as INT8)), 0) + 0) <= "sled"."usable_physical_ram")) AND ((COALESCE(SUM(CAST(reservoir_ram as INT8)), 0) + 17179869184) <= "sled"."reservoir_size")) ORDER BY random() LIMIT 1

If we remove the LIMIT 1 and run the query using CRDB's AS OF date set to just after the new sled was added, we see all four were available. Our best guesses at the moment are:

We just got extremely unlucky.
There is something suspect with ORDER BY random() that caused us to be unlucky.
We hit something like transaction_async is not cancel-safe or panic-safe, and bedlam ensues async-bb8-diesel#47, and the Nexus(s) choosing sled placement were seeing stale CRDB data. We checked for and saw no open transactions, so if this was happening we don't have any postmortem evidence of it.

The text was updated successfully, but these errors were encountered:

davepacheco · 2024-03-01T21:46:31Z

There is something suspect with ORDER BY random() that caused us to be unlucky.

I tried to gauge this by running the ORDER BY random() LIMIT 1 version a bunch of times too. I pasted this query:

SELECT "sled"."id" FROM ("sled" LEFT OUTER JOIN "sled_resource" ON ("sled_resource"."sled_id" = "sled"."id")) AS OF SYSTEM TIME '2024-03-01 19:14:35.337329+00' WHERE ((("sled"."time_deleted" IS NULL) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) GROUP BY "sled"."id" HAVING ((((COALESCE(SUM(CAST(hardware_threads as INT8)), 0) + 8) <= "sled"."usable_hardware_threads") AND ((COALESCE(SUM(CAST(rss_ram as INT8)), 0) + 0) <= "sled"."usable_physical_ram")) AND ((COALESCE(SUM(CAST(reservoir_ram as INT8)), 0) + 17179869184) <= "sled"."reservoir_size")) ORDER BY random() LIMIT 1;

100 times into a file, preceded by \x to get extended output mode. Then I ran it:

# grep id dap.out | sort | uniq -c | sort -n
  20 id | fa3c33bf-bc6c-4d29-8dcc-c16b9ab1ec89
  24 id | a243c1d0-9051-4b94-ab3e-f2a93fd0ae4f
  25 id | 1762267a-a0cb-4c9a-88b4-8ce828cbc021
  31 id | a6634f64-f6fb-426c-b00b-9b30aed9f53a

So it doesn't seem like either random() or ORDER BY random() is the problem.

We hit something like oxidecomputer/async-bb8-diesel#47, and the Nexus(s) choosing sled placement were seeing stale CRDB data. We checked for and saw no open transactions, so if this was happening we don't have any postmortem evidence of it.

@smklein also pointed out that async-bb8-diesel verifies that at least diesel thinks we're not in a transaction. So for something like this to have happened, we'd expect some other code path would have had to enter a transaction without using diesel (e.g., executing BEGIN;) and not closed it and returned the connection to the pool in that state...but then something else would had to have closed it (for us not to find it when we checked).

davepacheco · 2024-03-01T21:56:06Z

Probably useless but for posterity: I wondered if it were possible that we somehow did pick the new sled, but failed to use it for some reason, and then tried to use one of the others. I found no code path that would do this. I also looked in the saga log for all instance-create sagas during the demo day window where this happened and printed out the server allocated for each one:

root@[fd00:1122:3344:101::3]:32221/omicron> select data,count(*) FROM saga_node_event WHERE saga_id IN (
select id from saga WHERE name = 'instance-start' AND time_created >= '2024-03-01 19:14:41.236772+00' AN
D time_created <= '2024-03-01 19:24:31.438626+00') AND node_id = 1 AND event_type <> 'started' GROUP BY data;
                   data                  | count
-----------------------------------------+--------
  "1762267a-a0cb-4c9a-88b4-8ce828cbc021" |     9
  "fa3c33bf-bc6c-4d29-8dcc-c16b9ab1ec89" |    10
  "a6634f64-f6fb-426c-b00b-9b30aed9f53a" |    10
(3 rows)


Time: 40ms total (execution 39ms / network 0ms)

I think that rules out the idea that did pick this sled, then something failed, and we went and used a different sled. If that had happened, I'd expect to see extra instance-start sagas and/or some of them that had picked the new sled.

We hope this gives us some insight into #5181.

jgallagher · 2024-03-04T19:21:08Z

I tried reproducing this today after landing #5182, and it did not. Again I started 10 instances prior to adding the sled and 30 after. The 30 instances created after setting up the new sled had this distribution:

  12 BRM42220004 (new sled)
   4 BRM42220046 (total=8)
   6 BRM42220081 (total=10)
   8 BRM44220001 (total=10)

I noticed two things while doing this:

All four instance_start sagas were handled by one Nexus instance. (I don't know whether this is expected or problematic.)
When manually running the SELECT described above to see what sleds are eligible for new instances, the new sled was returned even before the sled had an NTP zone.

I'm going to take another lap and check what happens if I start instances after adding the sled but before going through Reconfigurator to give it NTP / crucible zones; that may result in a separate issue. Afterwards I'll try the baseline experiment a few more times to try to reproduce what we saw on Friday.

jgallagher added a commit that referenced this issue Mar 1, 2024

Log all possible sled targets when creating a reservation

3ca02e0

We hope this gives us some insight into #5181.

jgallagher mentioned this issue Mar 1, 2024

Log all possible sled targets when creating a reservation #5182

Merged

jgallagher added a commit that referenced this issue Mar 4, 2024

Log all possible sled targets when creating a reservation (#5182)

77669cd

We hope this gives us some insight into #5181.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

jgallagher commented Mar 1, 2024

davepacheco commented Mar 1, 2024

davepacheco commented Mar 1, 2024

jgallagher commented Mar 4, 2024 •

edited

Loading

Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

Comments

jgallagher commented Mar 1, 2024

davepacheco commented Mar 1, 2024

davepacheco commented Mar 1, 2024

jgallagher commented Mar 4, 2024 • edited Loading

jgallagher commented Mar 4, 2024 •

edited

Loading