-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newly-added sled wasn't chosen for instance allocation (for a while?) #5181
Comments
I tried to gauge this by running the SELECT "sled"."id" FROM ("sled" LEFT OUTER JOIN "sled_resource" ON ("sled_resource"."sled_id" = "sled"."id")) AS OF SYSTEM TIME '2024-03-01 19:14:35.337329+00' WHERE ((("sled"."time_deleted" IS NULL) AND ("sled"."sled_policy" = 'in_service')) AND ("sled"."sled_state" = 'active')) GROUP BY "sled"."id" HAVING ((((COALESCE(SUM(CAST(hardware_threads as INT8)), 0) + 8) <= "sled"."usable_hardware_threads") AND ((COALESCE(SUM(CAST(rss_ram as INT8)), 0) + 0) <= "sled"."usable_physical_ram")) AND ((COALESCE(SUM(CAST(reservoir_ram as INT8)), 0) + 17179869184) <= "sled"."reservoir_size")) ORDER BY random() LIMIT 1; 100 times into a file, preceded by
So it doesn't seem like either
@smklein also pointed out that async-bb8-diesel verifies that at least diesel thinks we're not in a transaction. So for something like this to have happened, we'd expect some other code path would have had to enter a transaction without using diesel (e.g., executing |
Probably useless but for posterity: I wondered if it were possible that we somehow did pick the new sled, but failed to use it for some reason, and then tried to use one of the others. I found no code path that would do this. I also looked in the saga log for all
I think that rules out the idea that did pick this sled, then something failed, and we went and used a different sled. If that had happened, I'd expect to see extra |
We hope this gives us some insight into #5181.
We hope this gives us some insight into #5181.
I tried reproducing this today after landing #5182, and it did not. Again I started 10 instances prior to adding the sled and 30 after. The 30 instances created after setting up the new sled had this distribution:
I noticed two things while doing this:
I'm going to take another lap and check what happens if I start instances after adding the sled but before going through Reconfigurator to give it NTP / crucible zones; that may result in a separate issue. Afterwards I'll try the baseline experiment a few more times to try to reproduce what we saw on Friday. |
During today's demos, I attempted to show adding a sled to
madrid
and seeing newly-started instances running on it. Before the demo, I started 10 instances (with only 3 sleds available), and the distribution was:During the demo, I started 30 instances, and the distribution was:
Each of these instances wanted 8 CPUs, so at this point the CPU resource usage was:
We started a call to debug why the new sled wasn't receiving instances. Without making any changes to
madrid
, I started 20 more instances, and they were allocated to these sleds in this order:Once each sled hits 16, we expectedly don't choose it again, as that maxes out its CPUs at 128. Purely eyeballing, this final list of 20 looks believable if we're choosing randomly, but getting 0 out of the first 30 on the new sled is exceedingly unlikely (~0.02%, assuming a 25% chance on each instance).
The exact query we're using to select a sled for one of the above instances is:
If we remove the
LIMIT 1
and run the query using CRDB'sAS OF date
set to just after the new sled was added, we see all four were available. Our best guesses at the moment are:ORDER BY random()
that caused us to be unlucky.transaction_async
is not cancel-safe or panic-safe, and bedlam ensues async-bb8-diesel#47, and the Nexus(s) choosing sled placement were seeing stale CRDB data. We checked for and saw no open transactions, so if this was happening we don't have any postmortem evidence of it.The text was updated successfully, but these errors were encountered: