Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to generate resource allocation (nodeshare) choices #3030

Merged
merged 12 commits into from
Oct 24, 2023

Conversation

yuvipanda
Copy link
Member

@yuvipanda yuvipanda commented Aug 25, 2023

When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers.

Factors

  • Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot.

  • Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for machine size. This can be quantified though, and help make the tradeoff.

  • Resource limits, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with jupyter-resource-usage

  • Resource requests are harder for end users to observe, as they are primarily meant for the scheduler, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed sometimes, but then get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out.

So in summary, there are two kinds of factors:

  1. Noticeable by users

    1. Server startup time
    2. Memory Limit
    3. CPU Limit
  2. Noticeable by infrastructure & hub admins:

    1. Cloud cost

The variables available to Infrastructure Engineers and hub admins to tune are:

  1. Size of instances offered

  2. "Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they may get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed

  3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be throttled easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is under the guarantee needs the memory, the first users's kernel will be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs.

Goals

The goal is the following:

  1. Profile options should be automatically generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices.

  2. The end user should be able to easily understand the ramifications of the options they choose, and it should be visible to them after they start their notebook as well.

  3. It's alright for users who want more resources to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up.

Generating Choices

This PR adds a new deployer command,
generate-resource-allocation-choices, to be run by an engineer setting up a hub. It currently supports a single node type, and will generate appropriate Resource Allocation choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes
community (#2882) and might be useful for other communities as well - the proportionate memory choice.

Proportionate Memory Allocation Strategy

Used primarily in research cases where:

  1. Workloads are more memory constrained than CPU constrained
  2. End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor.

It features:

  1. No memory overcommit at all, as end users are expected to ask for as much memory as they need.
  2. CPU guarantees are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will not run out of CPU on a node before running out of memory.
  3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU.
  4. Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared.

In the future, other strategies would be added and experimented with.

Node Capacity Information

To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for user pods on nodes of a particular type. Instead of using heuristics here, we calculate this accurately:

Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets

A json file, node-capacity-info.json has this information and is updated with a command update-node-capacity-info. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets.

This has been generated in this PR for a couple of common instances.

TODO

  • Documentation on how to update node-capacity-info.json
  • Documentation on how to generate choices, and when to use these
  • Documentation on how to choose the instance size

Thanks to @consideRatio for working on a lot of this earlier.

@yuvipanda yuvipanda requested a review from a team as a code owner August 25, 2023 03:41
@yuvipanda yuvipanda marked this pull request as draft August 25, 2023 03:41
@yuvipanda yuvipanda force-pushed the resource-allocation branch from 46e470f to 33a2b30 Compare August 25, 2023 03:42
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Aug 25, 2023
Based on discussing how profiles were actually being used
with the openscapes folks
(2i2c-org#2882)

Generated via the scripts in
2i2c-org#3030
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this <3!!! Thank you @yuvipanda

@yuvipanda yuvipanda changed the title Add script to generate nodeshare choices Add script to generate resource allocation (nodeshare) choices Aug 25, 2023
@yuvipanda
Copy link
Member Author

image

The choice of using the word 'resource allocation' than 'node share' is also based on conversations with the openscapes folks.

Comment on lines 47 to 53
# We operate on *available* memory, which already accounts for system components (like kubelet & systemd)
# as well as daemonsets we run on every node. This represents the resources that are available
# for user pods.
available_node_mem = nodeinfo["available"]["memory"]
available_node_cpu = nodeinfo["available"]["cpu"]
Copy link
Contributor

@consideRatio consideRatio Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change needed - headroom for memory/cpu requests

Motivation

This script is operating on available memory/cpu based on one off measurements, but there are many reasons for why adjusting to this is tricky:

  1. Instance types
    They are relatively easy to adjust to I think, we make a measure in k8s for each node we plan to use
  2. The managed k8s cluster's needs changing
    The daemonset's running on each node as managed by the k8s service may vary depending on features enabled (config connector, network policy enforcement, logging), k8s version, and vertical autoscaling determined needs.
  3. Our needs changing
    We could end up wanting to add more CPU/RAM requests to the cryptnono daemonset for example, or add another service to run on each user node - then we need to account for it also.

With that in mind, I think we shouldn't try to be so accurate - because if we are, there is no buffer to still manage to schedule a user requesting 100%.

Suggestion

I suggest we look at the following to establish a conservative baseline, and then add some headroom to that, such as 100m CPU and maybe 400MB RAM.

  • the capacity of RAM/CPU exposed to k8s pods for various machine types
  • the overhead from system pods in GKE, EKS, and AKS respectively - looking at several clusters of each provider to capture variations between enabled features and possibly also k8s versions
  • the overhead from our support chart's pods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should add some overhead here. I'll incorporate that. However we should still try to be as accurate as possible - you will note that the code that measures how much is available is taking into consideration all the other factors you listed, including the pods that are run as part of our support charts, and whatever it is that the clusters themselves run. I'll actually work on making this even more automated. The overhead is to allow for drift here, as this can change without us noticing - I'll try work on figuring out how best we can 'notice' and correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@consideRatio 62644fe adds some more flexibility, but much lower than what you recommended. This is because the script already accounts for the three things you have listed as needing headroom. This flexibility is now purely to account for changes that we miss. I'll work on listing when update-nodeinfo needs to be run, have some inline comments in there.

@consideRatio
Copy link
Contributor

consideRatio commented Sep 5, 2023

Choice about overlapping requests

image

In the setup above, each profile list entry represents a different machine type to run on. This means that you could have requests like ~4 CPU / ~32 GB on a n2-highmem-4 instance and ~4 CPU / 32 GB on a n2-highmem-16 instance that overlaps, resulting in a very similar request, but with different machine type.

I think its a good decision to reduce those similar choices to just one, but its not obvious what and it depends on usage patterns - patterns which could also change if its an event or not etc.

I think there is an incremental improvement to go for in the future, where we allow the cutoff between two server types to be drawn. For example.

1, 2, 4, 8, 16, 32 GB requests currently go to n2-highmem-4 machines, and 64, 128 requests go to highmem-16 machines, but it could be useful to allow that to slide so that we let the 16 and 32 GB requests end up on the highmem-16 machine instead for example - fitting up to 8 users on those.


EDIT: #3262 is a PR to use a 128 GB machine instead of a 32 GB machine by default during an event for resource requests of ~16 and ~32 GB, even though they would fit on a ~32 GB machine - because it puts more users per node and reduces overall startup delay.

@yuvipanda
Copy link
Member Author

@consideRatio there's two things with respect to overlap:

  1. Provide explicit guidance on what to do for events. I've some ideas in mind and will explicitly write this out
  2. There should probably be no overlap - because when there is overlap, the end user has to understand the concept of what an 'instance size' is, how that relates to spawning speed, how that relates to variety of factors. I agree there should be guidance on how to make the determination for which sizes go on which instances. I'm currently observing usage on Openscapes, and will use that information to write out this guidance.

So the TODO out of here is:

  • Write guidance on how to pick instance sizes,
  • Write guidance for events
  • Add some extra headroom to account for drift in our availability measurement
  • Think of ways to measure and account for this drift.

@yuvipanda
Copy link
Member Author

yuvipanda commented Sep 11, 2023

Checking in on openscapes since this was deployed to them on Aug 25.

Baseline cloud costs

So, openscapes got the older style node sharing when #2684 was merged, on June 21. And on Aug 25, we switched to the setup generated by this script. While there are some confounding factors (primarily, some events), I think the baseline cost has definitely gone way down with this new setup!

image

Startup speed

The baseline cost has come down, but is this at the cost of startup speed?

image

There's no discernable difference in server startup speeds! (Data missing for big parts of July though)

Qualitative feedback

Talking to folks in the openscapes slack, there has been a generally positive response to this. End users are less confused about what is needed, and the limits are now visible in JupyterLab.

I'll get working on documenting this so others can use it too, but I now believe that for at least openscapes style hubs, this is a good improvement over status quo.

@yuvipanda
Copy link
Member Author

yuvipanda commented Sep 11, 2023

I would, however, like this to be a little more automated than it is right now. In particular, I don't want us to have to do a big set of manual tweaking of all of these options every time we update node capacity information, as that's error prone and toil-y'. I'll look at ways of making that happen.

Comment on lines 105 to 120
# Add a little bit of wiggle room, to account for:
# 1. Changes in requests for system components as k8s versions upgrade or
# cloud providers roll out new components
# 2. We deploy support components but forget to update node capacity info
# 3. Whatever other things we aren't currently thinking of.
# A small amount of memory and CPU to sacrifice for the sake of
# operational flexibility. However, we *must* regenerate and update node
# information each time the following events occur:
# 1. We upgrade kubernetes versions
# 2. We change resource requirements for *daemonsets* in our support chart.
# 3. We upgrade z2jh version

# 128 MiB memory buffer
mem_available = mem_available - (128 * 1024 * 1024)
# 0.05 CPU overhead
cpu_available = cpu_available - 0.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning heavily towards ensuring that we don't run into "fail to start server" issues caused by this over minimizing the headroom. Can we make it at least something like 256 Mi and 0.15 CPU headroom?

I think a memory overhead of 128 is cutting it closer than merited given our nodes have 32GB+ and a failure in getting this right can cause an issue that we won't observe before we have a runtime failure detected by users. By not cutting it close, the memory can also help provide a buffer for when the node includes workloads that don't have requests=limits, so its also not wasted.

I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this out clearly, @consideRatio.

I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.

I think this makes 100% sense for the one strategy introduced in this PR, as it only tackles cpu requests not limits. So I'll increase the headroom to 0.15 (or 0.20) but move the headroom calculation to the strategy code, when we introduce other strategies in the future, they can make their own choices.

I'm trying to understand when exactly this will actually cause a "fail to start server" issue, rather than an issue where utilization of a server is not 100% when packed. I think this only happens at the case where the resource allocation sets guarantee to 100% of available memory. Looking at the set of choices currently deployed to openscapes, that's 2 of the 6 choices. So I understand in those cases, if circumstances change in such a way that the static calculation here for 'available' doesn't match, users would end up with a hanging pod that never gets scheduled anywhere. In the other 4 choices, what will happen instead is wastage of resources, as a new node will be spun up when a user might have already fit in the previous node.

So to handle the memory requests case, I will do the following:

  1. Increase the general overhead calculation, from which all resource allocations are measured, from 128Mi to 256Mi.
  2. Specifically for the choices where we are allocating a full node, increase this number even further (probably to 512Mi), as a measurement mismatch here would cause a bigger user problem than with (1).
  3. Open an issue to consider changing the strategy to have limits that don't count the overhead, and requests that do count the overhead. However, this makes the process a little more complex, and so I'd prefer to not do this on the first run. Sacrificing a little bit of extra RAM for the simplicity seems a worthwhile tradeoff.

I think over time, as we gain more experience with this strategy as well as data, we can fine tune this some more. Ideally, the "mem_available" data would be dynamic, rather than static - I think this would increase confidence level that this drift would not occur. I'll think of ideas how this can be done, but won't block progress on this PR on that.

Does this satisfy you, @consideRatio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that chat Yuvi!! I opened #3132 to describe my proposed simple strategy

yuvipanda and others added 6 commits October 23, 2023 13:50
When the end user looks at the profile list, the list needs to be
presented in such a way that they can make an informed choice on what
to select, with specific behavior that is triggered whenever their
usage goes over the selected numbers.

Factors
=======

- Server startup time! If everyone gets an instance just for themselves, servers
  take forever to start. Usually, many users are active at the same time, and we
  can decrease server startup time by putting many users on the same machine in
  a way they don't step on each others' foot.

- Cloud cost. If we pick really large machines, fewer scale up events need to
  be triggered, so server startup is much faster. However, we pay for instances
  regardless of how 'full' they are, so if we have a 64GB instance that only has
  1GB used, we're paying extra for that. So a trade-off has to be chosen for
  *machine size*. This can be quantified though, and help make the tradeoff.

- Resource *limits*, which the end user can consistently observe. These are
  easy to explain to end users - if you go over the memory limit, your kernel
  dies. If you go over the CPU limit, well, you can't - you get throttled. If
  we set limits appropriately, they will also helpfully show up in the status
  bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage)

- Resource *requests* are harder for end users to observe, as they are primarily
  meant for the *scheduler*, on how to pack user nodes together for higher
  utilization. This has an 'oversubscription' factor, relying on the fact that
  most users don't actually use resources upto their limit. However, this factor
  varies community to community, and must be carefully tuned. Users may use
  more resources than they are guaranteed *sometimes*, but then get their kernels
  killed or CPU throttled at *some other times*, based on what *other* users
  are doing. This inconsistent behavior is confusing to end users, and we should
  be careful to figure this out.

So in summary, there are two kinds of factors:

1. **Noticeable by users**
   1. Server startup time
   2. Memory Limit
   3. CPU Limit

2. **Noticeable by infrastructure & hub admins**:
   1. Cloud cost

The *variables* available to Infrastructure Engineers and hub admins to tune are:

1. Size of instances offered

2. "Oversubscription" factor for memory - this is ratio of memory
   limit to memory guarantee. If users are using memory > guarantee but <
   limit, they *may* get their kernels killed. Based on our knowledge of
   this community, we can tune this variable to reduce cloud cost while
   also reducing disruption in terms of kernels being killed

3. "Oversubscription" factor for CPU. This is easier to handle, as
   CPUs can be *throttled* easily. A user may use 4 CPUs for a minute,
   but then go back to 2 cpus next minute without anything being
   "killed". This is unlike memory, where memory once given can not be
   taken back. If a user is over the guarantee and another user who
   is *under* the guarantee needs the memory, the first users's
   kernel *will* be killed. Since this doesn't happen with CPUs, we can
   be more liberal in oversubscribing CPUs.

Goals
=====

The goal is the following:

1. Profile options should be *automatically* generated by a script,
   with various options to be tuned by the whoever is running
   it. Engineers should have an easy time making these choices.

2. The *end user* should be able to easily understand the
   ramifications of the options they choose, and it should be visible to
   them *after* they start their notebook as well.

3. It's alright for users who want *more resources* to have to wait
   longer for a server start than users who want fewer resources. This is
   incentive to start with fewer resources and then size up.

Generating Choices
==================

This PR adds a new deployer command,
`generate-resource-allocation-choices`, to be run by an engineer
setting up a hub. It currently supports a *single* node type, and will
generate appropriate *Resource Allocation* choices based on a given
strategy. This PR implements one specific strategy that has been
discussed well to work with the Openscapes
community (2i2c-org#2882) and
might be useful for other communities as well - the proportionate
memory choice.

Proportionate Memory Allocation Strategy
========================================

Used primarily in research cases where:
1. Workloads are more memory constrained than CPU constrained
2. End users can be expected to select appropriate amount of memory they need for a given
    workload, either by their own intrinsic knowledge or instructed by an instructor.

It features:
1. No memory overcommit at all, as end users are expected to ask for as much memory as
   they need.
2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you
   ask for, the more CPU you are guaranteed. This allows end users to pick resources purely
   based on memory only, simplifying the mental model. Also allows for maximum packing of
   user pods onto a node, as we will *not* run out of CPU on a node before running out of
   memory.
3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure
   that users will not be starved of CPU.
4. Each choice the user can make approximately has half as many resources as the next largest
   choice, with the largest being a full node. This offers a decent compromise - if you pick
   the largest option, you will most likely have to wait for a full node spawn, while smaller
    options are much more likely to be shared.

In the future, other strategies would be added and experimented with.

Node Capacity Information
=========================

To generate these choices, we must have Node Capacity Information -
particularly, exactly how much RAM and CPU is available for *user
pods* on nodes of a particular type. Instead of using heuristics
here, we calculate this *accurately*:

Resource Available = Node Capacity - System Components (kubelet,
systemd, etc) - Daemonsets

A json file, `node-capacity-info.json` has this information and is
updated with a command `update-node-capacity-info`. This requires
a node with the given instance type be actively running so we
can perform these calculations. This will need to be recalculated
every time we upgrade kubernetes (as system components might take
more resources) or adjust resource allocation for our daemonsets.

This has been generated in this PR for a couple of common instances.

TODO
====

- [ ] Documentation on how to update `node-capacity-info.json`
- [ ] Documentation on how to generate choices, and when to use
      these
- [ ] Documentation on how to choose the instance size

Co-authored-by: Erik Sundell <[email protected]>
Allows us to construct multiple profile choices with
multiple node types.
@yuvipanda yuvipanda force-pushed the resource-allocation branch from aa27637 to 73cad86 Compare October 23, 2023 08:20
@github-actions

This comment was marked as resolved.

@yuvipanda yuvipanda marked this pull request as ready for review October 23, 2023 09:00
Copy link
Contributor

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working this @yuvipanda! Based on agreement via other channels, let's merge this and iterate from it in separate PRs over time.

@consideRatio consideRatio merged commit d4224ce into 2i2c-org:master Oct 24, 2023
32 checks passed
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6624047336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

3 participants