Add script to generate resource allocation (nodeshare) choices #3030

yuvipanda · 2023-08-25T03:41:44Z

When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers.

Factors

Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot.
Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for machine size. This can be quantified though, and help make the tradeoff.
Resource limits, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with jupyter-resource-usage
Resource requests are harder for end users to observe, as they are primarily meant for the scheduler, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed sometimes, but then get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out.

So in summary, there are two kinds of factors:

Noticeable by users
1. Server startup time
2. Memory Limit
3. CPU Limit
Noticeable by infrastructure & hub admins:
1. Cloud cost

The variables available to Infrastructure Engineers and hub admins to tune are:

Size of instances offered
"Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they may get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed
"Oversubscription" factor for CPU. This is easier to handle, as CPUs can be throttled easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is under the guarantee needs the memory, the first users's kernel will be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs.

Goals

The goal is the following:

Profile options should be automatically generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices.
The end user should be able to easily understand the ramifications of the options they choose, and it should be visible to them after they start their notebook as well.
It's alright for users who want more resources to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up.

Generating Choices

This PR adds a new deployer command,
generate-resource-allocation-choices, to be run by an engineer setting up a hub. It currently supports a single node type, and will generate appropriate Resource Allocation choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes
community (#2882) and might be useful for other communities as well - the proportionate memory choice.

Proportionate Memory Allocation Strategy

Used primarily in research cases where:

Workloads are more memory constrained than CPU constrained
End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor.

It features:

No memory overcommit at all, as end users are expected to ask for as much memory as they need.
CPU guarantees are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will not run out of CPU on a node before running out of memory.
No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU.
Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared.

In the future, other strategies would be added and experimented with.

Node Capacity Information

To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for user pods on nodes of a particular type. Instead of using heuristics here, we calculate this accurately:

Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets

A json file, node-capacity-info.json has this information and is updated with a command update-node-capacity-info. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets.

This has been generated in this PR for a couple of common instances.

TODO

Documentation on how to update node-capacity-info.json
Documentation on how to generate choices, and when to use these
Documentation on how to choose the instance size

Thanks to @consideRatio for working on a lot of this earlier.

Based on discussing how profiles were actually being used with the openscapes folks (2i2c-org#2882) Generated via the scripts in 2i2c-org#3030

GeorgianaElena

I love this <3!!! Thank you @yuvipanda

yuvipanda · 2023-08-25T16:36:25Z

The choice of using the word 'resource allocation' than 'node share' is also based on conversations with the openscapes folks.

consideRatio · 2023-09-04T11:08:32Z

deployer/resource_allocation/generate_choices.py

+    # We operate on *available* memory, which already accounts for system components (like kubelet & systemd)
+    # as well as daemonsets we run on every node. This represents the resources that are available
+    # for user pods.
+    available_node_mem = nodeinfo["available"]["memory"]
+    available_node_cpu = nodeinfo["available"]["cpu"]


Change needed - headroom for memory/cpu requests

Motivation

This script is operating on available memory/cpu based on one off measurements, but there are many reasons for why adjusting to this is tricky:

Instance types
They are relatively easy to adjust to I think, we make a measure in k8s for each node we plan to use

The managed k8s cluster's needs changing
The daemonset's running on each node as managed by the k8s service may vary depending on features enabled (config connector, network policy enforcement, logging), k8s version, and vertical autoscaling determined needs.

Our needs changing
We could end up wanting to add more CPU/RAM requests to the cryptnono daemonset for example, or add another service to run on each user node - then we need to account for it also.

With that in mind, I think we shouldn't try to be so accurate - because if we are, there is no buffer to still manage to schedule a user requesting 100%.

Suggestion

I suggest we look at the following to establish a conservative baseline, and then add some headroom to that, such as 100m CPU and maybe 400MB RAM.

the capacity of RAM/CPU exposed to k8s pods for various machine types

the overhead from system pods in GKE, EKS, and AKS respectively - looking at several clusters of each provider to capture variations between enabled features and possibly also k8s versions

the overhead from our support chart's pods

I agree we should add some overhead here. I'll incorporate that. However we should still try to be as accurate as possible - you will note that the code that measures how much is available is taking into consideration all the other factors you listed, including the pods that are run as part of our support charts, and whatever it is that the clusters themselves run. I'll actually work on making this even more automated. The overhead is to allow for drift here, as this can change without us noticing - I'll try work on figuring out how best we can 'notice' and correct.

@consideRatio 62644fe adds some more flexibility, but much lower than what you recommended. This is because the script already accounts for the three things you have listed as needing headroom. This flexibility is now purely to account for changes that we miss. I'll work on listing when update-nodeinfo needs to be run, have some inline comments in there.

consideRatio · 2023-09-05T14:54:10Z

Choice about overlapping requests

In the setup above, each profile list entry represents a different machine type to run on. This means that you could have requests like ~4 CPU / ~32 GB on a n2-highmem-4 instance and ~4 CPU / 32 GB on a n2-highmem-16 instance that overlaps, resulting in a very similar request, but with different machine type.

I think its a good decision to reduce those similar choices to just one, but its not obvious what and it depends on usage patterns - patterns which could also change if its an event or not etc.

I think there is an incremental improvement to go for in the future, where we allow the cutoff between two server types to be drawn. For example.

1, 2, 4, 8, 16, 32 GB requests currently go to n2-highmem-4 machines, and 64, 128 requests go to highmem-16 machines, but it could be useful to allow that to slide so that we let the 16 and 32 GB requests end up on the highmem-16 machine instead for example - fitting up to 8 users on those.

EDIT: #3262 is a PR to use a 128 GB machine instead of a 32 GB machine by default during an event for resource requests of ~16 and ~32 GB, even though they would fit on a ~32 GB machine - because it puts more users per node and reduces overall startup delay.

yuvipanda · 2023-09-05T17:45:22Z

@consideRatio there's two things with respect to overlap:

Provide explicit guidance on what to do for events. I've some ideas in mind and will explicitly write this out
There should probably be no overlap - because when there is overlap, the end user has to understand the concept of what an 'instance size' is, how that relates to spawning speed, how that relates to variety of factors. I agree there should be guidance on how to make the determination for which sizes go on which instances. I'm currently observing usage on Openscapes, and will use that information to write out this guidance.

So the TODO out of here is:

Write guidance on how to pick instance sizes,
Write guidance for events
Add some extra headroom to account for drift in our availability measurement
Think of ways to measure and account for this drift.

yuvipanda · 2023-09-11T21:54:35Z

Checking in on openscapes since this was deployed to them on Aug 25.

Baseline cloud costs

So, openscapes got the older style node sharing when #2684 was merged, on June 21. And on Aug 25, we switched to the setup generated by this script. While there are some confounding factors (primarily, some events), I think the baseline cost has definitely gone way down with this new setup!

Startup speed

The baseline cost has come down, but is this at the cost of startup speed?

There's no discernable difference in server startup speeds! (Data missing for big parts of July though)

Qualitative feedback

Talking to folks in the openscapes slack, there has been a generally positive response to this. End users are less confused about what is needed, and the limits are now visible in JupyterLab.

I'll get working on documenting this so others can use it too, but I now believe that for at least openscapes style hubs, this is a good improvement over status quo.

yuvipanda · 2023-09-11T21:56:07Z

I would, however, like this to be a little more automated than it is right now. In particular, I don't want us to have to do a big set of manual tweaking of all of these options every time we update node capacity information, as that's error prone and toil-y'. I'll look at ways of making that happen.

consideRatio · 2023-09-11T22:49:43Z

deployer/resource_allocation/update_nodeinfo.py

+    # Add a little bit of wiggle room, to account for:
+    # 1. Changes in requests for system components as k8s versions upgrade or
+    #    cloud providers roll out new components
+    # 2. We deploy support components but forget to update node capacity info
+    # 3. Whatever other things we aren't currently thinking of.
+    # A small amount of memory and CPU to sacrifice for the sake of
+    # operational flexibility. However, we *must* regenerate and update node
+    # information each time the following events occur:
+    # 1. We upgrade kubernetes versions
+    # 2. We change resource requirements for *daemonsets* in our support chart.
+    # 3. We upgrade z2jh version
+
+    # 128 MiB memory buffer
+    mem_available = mem_available - (128 * 1024 * 1024)
+    # 0.05 CPU overhead
+    cpu_available = cpu_available - 0.05


I'm leaning heavily towards ensuring that we don't run into "fail to start server" issues caused by this over minimizing the headroom. Can we make it at least something like 256 Mi and 0.15 CPU headroom?

I think a memory overhead of 128 is cutting it closer than merited given our nodes have 32GB+ and a failure in getting this right can cause an issue that we won't observe before we have a runtime failure detected by users. By not cutting it close, the memory can also help provide a buffer for when the node includes workloads that don't have requests=limits, so its also not wasted.

I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.

Thank you for writing this out clearly, @consideRatio.

I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.

I think this makes 100% sense for the one strategy introduced in this PR, as it only tackles cpu requests not limits. So I'll increase the headroom to 0.15 (or 0.20) but move the headroom calculation to the strategy code, when we introduce other strategies in the future, they can make their own choices.

I'm trying to understand when exactly this will actually cause a "fail to start server" issue, rather than an issue where utilization of a server is not 100% when packed. I think this only happens at the case where the resource allocation sets guarantee to 100% of available memory. Looking at the set of choices currently deployed to openscapes, that's 2 of the 6 choices. So I understand in those cases, if circumstances change in such a way that the static calculation here for 'available' doesn't match, users would end up with a hanging pod that never gets scheduled anywhere. In the other 4 choices, what will happen instead is wastage of resources, as a new node will be spun up when a user might have already fit in the previous node.

So to handle the memory requests case, I will do the following:

Increase the general overhead calculation, from which all resource allocations are measured, from 128Mi to 256Mi.

Specifically for the choices where we are allocating a full node, increase this number even further (probably to 512Mi), as a measurement mismatch here would cause a bigger user problem than with (1).

Open an issue to consider changing the strategy to have limits that don't count the overhead, and requests that do count the overhead. However, this makes the process a little more complex, and so I'd prefer to not do this on the first run. Sacrificing a little bit of extra RAM for the simplicity seems a worthwhile tradeoff.

I think over time, as we gain more experience with this strategy as well as data, we can fine tune this some more. Ideally, the "mem_available" data would be dynamic, rather than static - I think this would increase confidence level that this drift would not occur. I'll think of ideas how this can be done, but won't block progress on this PR on that.

Does this satisfy you, @consideRatio?

Thank you for that chat Yuvi!! I opened #3132 to describe my proposed simple strategy

When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers. Factors ======= - Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot. - Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for *machine size*. This can be quantified though, and help make the tradeoff. - Resource *limits*, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage) - Resource *requests* are harder for end users to observe, as they are primarily meant for the *scheduler*, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed *sometimes*, but then get their kernels killed or CPU throttled at *some other times*, based on what *other* users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out. So in summary, there are two kinds of factors: 1. **Noticeable by users** 1. Server startup time 2. Memory Limit 3. CPU Limit 2. **Noticeable by infrastructure & hub admins**: 1. Cloud cost The *variables* available to Infrastructure Engineers and hub admins to tune are: 1. Size of instances offered 2. "Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they *may* get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed 3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be *throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is *under* the guarantee needs the memory, the first users's kernel *will* be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs. Goals ===== The goal is the following: 1. Profile options should be *automatically* generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices. 2. The *end user* should be able to easily understand the ramifications of the options they choose, and it should be visible to them *after* they start their notebook as well. 3. It's alright for users who want *more resources* to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up. Generating Choices ================== This PR adds a new deployer command, `generate-resource-allocation-choices`, to be run by an engineer setting up a hub. It currently supports a *single* node type, and will generate appropriate *Resource Allocation* choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes community (2i2c-org#2882) and might be useful for other communities as well - the proportionate memory choice. Proportionate Memory Allocation Strategy ======================================== Used primarily in research cases where: 1. Workloads are more memory constrained than CPU constrained 2. End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor. It features: 1. No memory overcommit at all, as end users are expected to ask for as much memory as they need. 2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will *not* run out of CPU on a node before running out of memory. 3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU. 4. Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared. In the future, other strategies would be added and experimented with. Node Capacity Information ========================= To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for *user pods* on nodes of a particular type. Instead of using heuristics here, we calculate this *accurately*: Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets A json file, `node-capacity-info.json` has this information and is updated with a command `update-node-capacity-info`. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets. This has been generated in this PR for a couple of common instances. TODO ==== - [ ] Documentation on how to update `node-capacity-info.json` - [ ] Documentation on how to generate choices, and when to use these - [ ] Documentation on how to choose the instance size Co-authored-by: Erik Sundell <[email protected]>

Allows us to construct multiple profile choices with multiple node types.

consideRatio

Thank you for working this @yuvipanda! Based on agreement via other channels, let's merge this and iterate from it in separate PRs over time.

github-actions · 2023-10-24T08:06:41Z

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6624047336

yuvipanda requested a review from a team as a code owner August 25, 2023 03:41

yuvipanda marked this pull request as draft August 25, 2023 03:41

github-actions bot assigned yuvipanda Aug 25, 2023

yuvipanda force-pushed the resource-allocation branch from 46e470f to 33a2b30 Compare August 25, 2023 03:42

yuvipanda mentioned this pull request Aug 25, 2023

Simplify OpenScapes profile options #3031

Merged

GeorgianaElena reviewed Aug 25, 2023

View reviewed changes

yuvipanda changed the title ~~Add script to generate nodeshare choices~~ Add script to generate resource allocation (nodeshare) choices Aug 25, 2023

consideRatio reviewed Sep 4, 2023

View reviewed changes

consideRatio reviewed Sep 11, 2023

View reviewed changes

consideRatio mentioned this pull request Sep 13, 2023

Discuss how to calculate overhead when setting resource allocation choices #3132

Closed

consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 11, 2023

This was referenced Oct 11, 2023

Transition clusters providing users with dedicated nodes to use shared nodes #2661

Closed

openscapes: Use a bigger node to account for future event expected usage #3262

Merged

This was referenced Oct 19, 2023

Let node resource-allocation script handle optimizations for events #3293

Closed

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

Open

yuvipanda and others added 6 commits October 23, 2023 13:50

Put instance-type label selector in override

5007a0c

Allows us to construct multiple profile choices with multiple node types.

Add comment about instance-types used

721eb79

Add some more adjustable headroom

3da314e

Explicitly list measured overhead in nodeinfo

2a9609b

Add note about FIXME for wiggle room

78a7822

yuvipanda added 5 commits October 23, 2023 13:50

Update some GCP info too

b22c23f

Don't accidentally pick up a core node when making measurements

1f9ca76

Update some more node-infos

235c513

Wait for nodes to get ready before we measure

9657576

Add a resource allocation topic document

73cad86

yuvipanda force-pushed the resource-allocation branch from aa27637 to 73cad86 Compare October 23, 2023 08:20

Add kubernetes to requirements

8e21f22

This comment was marked as resolved.

Sign in to view

yuvipanda marked this pull request as ready for review October 23, 2023 09:00

consideRatio approved these changes Oct 24, 2023

View reviewed changes

consideRatio merged commit d4224ce into 2i2c-org:master Oct 24, 2023
32 checks passed

yuvipanda mentioned this pull request Jan 6, 2024

Clarify and document rounding for resource allocation list #3584

Open

yuvipanda mentioned this pull request Jan 16, 2024

Meet with openscapes folks to understand their use of profiles and node sharing #2882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to generate resource allocation (nodeshare) choices #3030

Add script to generate resource allocation (nodeshare) choices #3030

yuvipanda commented Aug 25, 2023 •

edited

Loading

GeorgianaElena left a comment

yuvipanda commented Aug 25, 2023

consideRatio Sep 4, 2023 •

edited

Loading

yuvipanda Sep 5, 2023

yuvipanda Sep 11, 2023

consideRatio commented Sep 5, 2023 •

edited

Loading

yuvipanda commented Sep 5, 2023

yuvipanda commented Sep 11, 2023 •

edited

Loading

yuvipanda commented Sep 11, 2023 •

edited

Loading

consideRatio Sep 11, 2023

yuvipanda Sep 13, 2023

consideRatio Sep 13, 2023

This comment was marked as resolved.

consideRatio left a comment

github-actions bot commented Oct 24, 2023

Add script to generate resource allocation (nodeshare) choices #3030

Add script to generate resource allocation (nodeshare) choices #3030

Conversation

yuvipanda commented Aug 25, 2023 • edited Loading

Factors

Goals

Generating Choices

Proportionate Memory Allocation Strategy

Node Capacity Information

TODO

GeorgianaElena left a comment

Choose a reason for hiding this comment

yuvipanda commented Aug 25, 2023

consideRatio Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Change needed - headroom for memory/cpu requests

Motivation

Suggestion

yuvipanda Sep 5, 2023

Choose a reason for hiding this comment

yuvipanda Sep 11, 2023

Choose a reason for hiding this comment

consideRatio commented Sep 5, 2023 • edited Loading

Choice about overlapping requests

yuvipanda commented Sep 5, 2023

yuvipanda commented Sep 11, 2023 • edited Loading

Baseline cloud costs

Startup speed

Qualitative feedback

yuvipanda commented Sep 11, 2023 • edited Loading

consideRatio Sep 11, 2023

Choose a reason for hiding this comment

yuvipanda Sep 13, 2023

Choose a reason for hiding this comment

consideRatio Sep 13, 2023

Choose a reason for hiding this comment

This comment was marked as resolved.

consideRatio left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 24, 2023

yuvipanda commented Aug 25, 2023 •

edited

Loading

consideRatio Sep 4, 2023 •

edited

Loading

consideRatio commented Sep 5, 2023 •

edited

Loading

yuvipanda commented Sep 11, 2023 •

edited

Loading

yuvipanda commented Sep 11, 2023 •

edited

Loading