Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newly-added sled wasn't chosen for instance allocation (for a while?) #5181

Open
jgallagher opened this issue Mar 1, 2024 · 3 comments
Open

Comments

@jgallagher
Copy link
Contributor

During today's demos, I attempted to show adding a sled to madrid and seeing newly-started instances running on it. Before the demo, I started 10 instances (with only 3 sleds available), and the distribution was:

   1 BRM42220046
   5 BRM42220081
   4 BRM44220001

During the demo, I started 30 instances, and the distribution was:

  10  BRM42220046
  10  BRM42220081
   9  BRM44220001
   0  BRM42220004 (added sled)

Each of these instances wanted 8 CPUs, so at this point the CPU resource usage was:

  11 instances =  88 CPUs  BRM42220046
  15 instances = 120 CPUs  BRM42220081
  13 instances = 104 CPUs  BRM44220001

We started a call to debug why the new sled wasn't receiving instances. Without making any changes to madrid, I started 20 more instances, and they were allocated to these sleds in this order:

BRM42220081 (new total = 16)
BRM42220004 (new total =  1)
BRM42220046 (new total = 12)
BRM42220004 (new total =  2)
BRM42220046 (new total = 13)
BRM42220046 (new total = 14)
BRM42220046 (new total = 15)
BRM42220004 (new total =  3)
BRM42220004 (new total =  4)
BRM42220004 (new total =  5)
BRM44220001 (new total = 14)
BRM42220004 (new total =  6)
BRM42220004 (new total =  7)
BRM42220046 (new total = 15)
BRM42220004 (new total =  8)
BRM42220004 (new total =  9)
BRM44220001 (new total = 15)
BRM44220001 (new total = 16)
BRM42220004 (new total = 10)
BRM42220004 (new total = 11)

Once each sled hits 16, we expectedly don't choose it again, as that maxes out its CPUs at 128. Purely eyeballing, this final list of 20 looks believable if we're choosing randomly, but getting 0 out of the first 30 on the new sled is exceedingly unlikely (~0.02%, assuming a 25% chance on each instance).

The exact query we're using to select a sled for one of the above instances is:

SELECT "sled"."id" FROM ("sled" LEFT OUTER JOIN "sled_resource" ON ("sled_resource"."sled_id" = "sled"."id")) WHERE ((("sled"."time_deleted" IS NULL) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) GROUP BY "sled"."id" HAVING ((((COALESCE(SUM(CAST(hardware_threads as INT8)), 0) + 8) <= "sled"."usable_hardware_threads") AND ((COALESCE(SUM(CAST(rss_ram as INT8)), 0) + 0) <= "sled"."usable_physical_ram")) AND ((COALESCE(SUM(CAST(reservoir_ram as INT8)), 0) + 17179869184) <= "sled"."reservoir_size")) ORDER BY random() LIMIT 1

If we remove the LIMIT 1 and run the query using CRDB's AS OF date set to just after the new sled was added, we see all four were available. Our best guesses at the moment are:

@davepacheco
Copy link
Collaborator

There is something suspect with ORDER BY random() that caused us to be unlucky.

I tried to gauge this by running the ORDER BY random() LIMIT 1 version a bunch of times too. I pasted this query:

SELECT "sled"."id" FROM ("sled" LEFT OUTER JOIN "sled_resource" ON ("sled_resource"."sled_id" = "sled"."id")) AS OF SYSTEM TIME '2024-03-01 19:14:35.337329+00' WHERE ((("sled"."time_deleted" IS NULL) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) GROUP BY "sled"."id" HAVING ((((COALESCE(SUM(CAST(hardware_threads as INT8)), 0) + 8) <= "sled"."usable_hardware_threads") AND ((COALESCE(SUM(CAST(rss_ram as INT8)), 0) + 0) <= "sled"."usable_physical_ram")) AND ((COALESCE(SUM(CAST(reservoir_ram as INT8)), 0) + 17179869184) <= "sled"."reservoir_size")) ORDER BY random() LIMIT 1;

100 times into a file, preceded by \x to get extended output mode. Then I ran it:

# grep id dap.out | sort | uniq -c | sort -n
  20 id | fa3c33bf-bc6c-4d29-8dcc-c16b9ab1ec89
  24 id | a243c1d0-9051-4b94-ab3e-f2a93fd0ae4f
  25 id | 1762267a-a0cb-4c9a-88b4-8ce828cbc021
  31 id | a6634f64-f6fb-426c-b00b-9b30aed9f53a

So it doesn't seem like either random() or ORDER BY random() is the problem.


We hit something like oxidecomputer/async-bb8-diesel#47, and the Nexus(s) choosing sled placement were seeing stale CRDB data. We checked for and saw no open transactions, so if this was happening we don't have any postmortem evidence of it.

@smklein also pointed out that async-bb8-diesel verifies that at least diesel thinks we're not in a transaction. So for something like this to have happened, we'd expect some other code path would have had to enter a transaction without using diesel (e.g., executing BEGIN;) and not closed it and returned the connection to the pool in that state...but then something else would had to have closed it (for us not to find it when we checked).

@davepacheco
Copy link
Collaborator

Probably useless but for posterity: I wondered if it were possible that we somehow did pick the new sled, but failed to use it for some reason, and then tried to use one of the others. I found no code path that would do this. I also looked in the saga log for all instance-create sagas during the demo day window where this happened and printed out the server allocated for each one:

root@[fd00:1122:3344:101::3]:32221/omicron> select data,count(*) FROM saga_node_event WHERE saga_id IN (
select id from saga WHERE name = 'instance-start' AND time_created >= '2024-03-01 19:14:41.236772+00' AN
D time_created <= '2024-03-01 19:24:31.438626+00') AND node_id = 1 AND event_type <> 'started' GROUP BY data;
                   data                  | count
-----------------------------------------+--------
  "1762267a-a0cb-4c9a-88b4-8ce828cbc021" |     9
  "fa3c33bf-bc6c-4d29-8dcc-c16b9ab1ec89" |    10
  "a6634f64-f6fb-426c-b00b-9b30aed9f53a" |    10
(3 rows)


Time: 40ms total (execution 39ms / network 0ms)

I think that rules out the idea that did pick this sled, then something failed, and we went and used a different sled. If that had happened, I'd expect to see extra instance-start sagas and/or some of them that had picked the new sled.

jgallagher added a commit that referenced this issue Mar 1, 2024
We hope this gives us some insight into
#5181.
jgallagher added a commit that referenced this issue Mar 4, 2024
@jgallagher
Copy link
Contributor Author

jgallagher commented Mar 4, 2024

I tried reproducing this today after landing #5182, and it did not. Again I started 10 instances prior to adding the sled and 30 after. The 30 instances created after setting up the new sled had this distribution:

  12 BRM42220004 (new sled)
   4 BRM42220046 (total=8)
   6 BRM42220081 (total=10)
   8 BRM44220001 (total=10)

I noticed two things while doing this:

  • All four instance_start sagas were handled by one Nexus instance. (I don't know whether this is expected or problematic.)
  • When manually running the SELECT described above to see what sleds are eligible for new instances, the new sled was returned even before the sled had an NTP zone.

I'm going to take another lap and check what happens if I start instances after adding the sled but before going through Reconfigurator to give it NTP / crucible zones; that may result in a separate issue. Afterwards I'll try the baseline experiment a few more times to try to reproduce what we saw on Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants