Merge pull request 2i2c-org#3522 from GeorgianaElena/event-prep

[documentation] Start an event preparation guide
sgibson91 · Dec 15, 2023 · 4454050 · 4454050
2 parents 4b07607 + ac9e376
commit 4454050
Show file tree

Hide file tree

Showing 10 changed files with 220 additions and 1 deletion.
diff --git a/docs/conf.py b/docs/conf.py
@@ -17,6 +17,7 @@
     "sphinx_design",
     "sphinxcontrib.mermaid",
     "sphinxcontrib.jquery",
+    "sphinx_togglebutton",
 ]
 
 intersphinx_mapping = {

diff --git a/docs/howto/features/dedicated-nodepool.md b/docs/howto/features/dedicated-nodepool.md
@@ -1,3 +1,4 @@
+(features:shared-cluster:dedicated-nodepool)=
 # Setup a dedicated nodepool for a hub on a shared cluster
 
 Some hubs on shared clusters require dedicated nodepools, for a few reasons:

diff --git a/docs/howto/prepare-for-events/event-prep.md b/docs/howto/prepare-for-events/event-prep.md
@@ -0,0 +1,198 @@
+# Event infrastructure preparation checklist
+
+Below are listed the main aspects to consider adjusting on a hub to prepare it for an event:
+
+## 1. Quotas
+
+We must ensure that the quotas from the cloud provider are high-enough to handle expected usage. It might be that the number of users attending the event is very big, or their expected resource usage is big, or both. Either way, we need to check the the existing quotas will accommodate the new numbers.
+
+```{admonition} Action to take
+:class: tip
+- follow the [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in an AWS project
+- follow the [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project
+```
+
+## 2. Consider dedicated nodepools on shared clusters
+
+If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, and avoid impacting other hub's users performance.
+
+```{admonition} Action to take
+:class: tip
+Follow the guide at [](features:shared-cluster:dedicated-nodepool) in order to setup a dedicated nodepool before an event.
+```
+
+## 3. Pre-warm the hub to reduce wait times
+
+There are two mechanisms that we can use to pre-warm a hub before an event:
+
+- making sure some **nodes are ready** when users arrive
+
+    This can be done using node sharing via profile lists or by setting a minimum node count.
+
+    ```{note}
+    You can read more about what to consider when setting resource allocation options in profile lists in [](topic:resource-allocation).
+    ```
+
+    ```{admonition} Expand this to find out the benefits of node sharing via profile lists
+    :class: dropdown
+    Specifically for events, the node sharing benefits via profile lists vs. setting a minimum node count are:
+
+      - **no `terraform/eks` infrastructure changes**
+
+        they shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to [](topic:cluster-design:instance-type) that should cover most usage needs
+
+      - **more cost flexibility**
+
+        we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs
+
+      - **less engineering intervention needed**
+
+        the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready.
+    ```
+
+- the user **image is not huge**, otherwise pre-pulling it must be considered
+
+
+### 3.1. Node sharing via profile lists
+
+```{important}
+Currently, this is the recommended way to handle an event on a hub. However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case.
+```
+
+During events, we want to tilt the balance towards reducing server startup time. The docs at [](topic:resource-allocation) have more information about all the factors that should be considered during resource allocation.
+
+Assuming this hub already has a profile list, before an event, you should check the following:
+
+1. **Information is available**
+
+    Make sure the information in the event GitHub issue was filled in, especially the number of expected users before an event and their expected resource needs (if that can be known by the community beforehand).
+
+2. **Given the current setup, calculate how many users will fit on a node?**
+
+    Check that the current number of users/node respects the following general event wishlist.
+
+3. **Minimize startup time**
+
+  - have at least `3-4 people on a node` as few users per node cause longer startup times, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers)
+
+  - don't have more than 30% of the users waiting for a node to come up
+
+    ````{admonition} Action to take
+    :class: tip
+
+    If the current number of users per node doesn't respect the rules above, you should adjust the instance type so that it does.
+    Note that if you are changing the instance type, you should also consider re-writing the allocation options, especially if you are going with a smaller machine than the original one.
+
+    ```{code-block}
+    deployer generate resource-allocation choices <instance type>
+    ```
+    ````
+
+4. **Don't oversubscribe resources**
+
+    The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node.
+
+    However, a bigger oversubscription factor also means that the users that use more resources than they are guaranteed can get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users and the hub, so we should try and avoid this during events.
+
+    `````{admonition} Action to take
+    :class: tip
+
+    For an event, you should consider an oversubscription factor of 1.
+
+    - if the instance type remains unchanged, then just adjust the limit to match the memory guarantee if not already the case
+
+    - if the instance type also changes, then you can use the `deployer generate resource-allocation` command, passing it the new instance type and optionally the number of choices.
+
+      You can then use its output to:
+        - either replace all allocation options with the ones for the new node type
+        - or pick the choice(s) that will be used during the event based on expected usage and just don't show the others
+
+    ````{admonition} Example
+    For example, if the community expects to only use ~3GB of memory during an event, and no other users are expected to use the hub for the duration of the event, then you can choose to only make available that one option.
+
+    Assuming they had 4 options on a `n2-highmem-2` machine and we wish to move them on a `n2-highmem-4` for the event, we could run:
+    
+    ```{code-block}
+    deployer generate resource-allocation choices n2-highmem-4 --num-allocations 4
+    ```
+
+    which will output:
+
+    ```{code-block}
+    # pick this option to present the single ~3GB memory option for the event
+    mem_3_4:
+      display_name: 3.4 GB RAM, upto 3.485 CPUs
+      kubespawner_override:
+        mem_guarantee: 3662286336
+        mem_limit: 3662286336
+        cpu_guarantee: 0.435625
+        cpu_limit: 3.485
+        node_selector:
+          node.kubernetes.io/instance-type: n2-highmem-4
+      default: true
+    mem_6_8:
+      display_name: 6.8 GB RAM, upto 3.485 CPUs
+      kubespawner_override:
+        mem_guarantee: 7324572672
+        mem_limit: 7324572672
+        cpu_guarantee: 0.87125
+        cpu_limit: 3.485
+        node_selector:
+          node.kubernetes.io/instance-type: n2-highmem-4
+    (...2 more options)
+    ```
+    And we would have this in the profileList configuration:
+    ```{code-block}
+    profileList:
+      - display_name: Workshop
+        description: Workshop environment
+        default: true
+        kubespawner_override:
+          image: python:6ee57a9
+        profile_options:
+          requests:
+            display_name: Resource Allocation
+            choices:
+              mem_3_4:
+                display_name: 3.4 GB RAM, upto 3.485 CPUs
+                kubespawner_override:
+                  mem_guarantee: 3662286336
+                  mem_limit: 3662286336
+                  cpu_guarantee: 0.435625
+                  cpu_limit: 3.485
+                  node_selector:
+                    node.kubernetes.io/instance-type: n2-highmem-4
+    ```
+    ````
+
+    ````{warning}
+    The `deployer generate resource-allocation`:
+    - cam only generate options where guarantees (requests) equal limits!
+    - supports the instance types located in `node-capacity-info.json` file
+    ````
+    `````
+
+### 3.2. Setting a minimum node count on a specific node pool
+```{warning}
+This section is a Work in Progress!
+```
+
+### 3.3. Pre-pulling the image
+
+```{warning}
+This section is a Work in Progress!
+```
+
+Relevant discussions:
+- https://github.com/2i2c-org/infrastructure/issues/2541
+- https://github.com/2i2c-org/infrastructure/pull/3313
+- https://github.com/2i2c-org/infrastructure/pull/3341
+
+```{important}
+To get a deeper understanding of the resource allocation topic, you can read up these issues and documentation pieces:
+- https://github.com/2i2c-org/infrastructure/issues/2121
+- https://github.com/2i2c-org/infrastructure/pull/3030
+- https://github.com/2i2c-org/infrastructure/issues/3132
+- https://github.com/2i2c-org/infrastructure/issues/3293
+- https://infrastructure.2i2c.org/topic/resource-allocation/#factors-to-balance
+```
diff --git a/docs/howto/exam.md → docs/howto/prepare-for-events/exam.md b/docs/howto/exam.md → docs/howto/prepare-for-events/exam.md
diff --git a/docs/howto/prepare-for-events/index.md b/docs/howto/prepare-for-events/index.md
@@ -0,0 +1,15 @@
+# Manage events on 2i2c hubs
+
+A hub's specific setup is usually optimized based on the day to day usage expectations. But because events usually imply a different usage pattern, the infrastructure might need to be adjusted in order to accommodate the spikes in activity.
+
+```{important}
+The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed.
+```
+
+The events might vary in type, so the following list is not complete and does not cover all of them (yet). Most common event types are exams, workshops etc.
+
+```{toctree}
+:maxdepth: 2
+event-prep.md
+exam.md
+```
diff --git a/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md b/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md
@@ -43,6 +43,7 @@ More information on these terms can be found in [](cloud-access:aws).
 You have successfully created a new AWS account and connected it to our AWS Organization's Management Account!
 Now, [setup a new cluster](new-cluster:aws) inside it via Terraform.
 
+(hub-deployment-guide:cloud-accounts:aws-quotas)=
 ## Checking quotas and requesting increases
 
 Cloud providers like AWS require their users to request a _Service Quota

diff --git a/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md b/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md
@@ -23,6 +23,7 @@
    ```
 7. [Setup a new cluster](new-cluster:new-cluster) inside it via Terraform
 
+(hub-deployment-guide:cloud-accounts:gcp-quotas)=
 ## Checking quotas and requesting increases
 
 Finally, we should check what quotas are enforced on the project and increase them as necessary.

diff --git a/docs/index.md b/docs/index.md
@@ -70,7 +70,7 @@ deployed occasionally as a specific addition.
 howto/features/index.md
 howto/bill.md
 howto/custom-jupyterhub-image.md
-howto/exam.md
+howto/prepare-for-events/index.md
 howto/manage-domains/index.md
 howto/grafana-github-auth.md
 howto/update-env.md

diff --git a/docs/topic/infrastructure/cluster-design.md b/docs/topic/infrastructure/cluster-design.md
@@ -77,6 +77,7 @@ On GKE clusters with network policy enforcement, we look to edit the
 `calico-typha-horizontal-autoscaler` ConfigMap in `kube-system` to avoid scaling
 up to two replicas unless there are very many nodes in the k8s cluster.
 
+(topic:cluster-design:instance-type)=
 ### Our instance type choice
 
 #### For nodes where core services will be scheduled on

diff --git a/docs/topic/resource-allocation.md b/docs/topic/resource-allocation.md
@@ -1,3 +1,4 @@
+(topic:resource-allocation)=
 # Resource Allocation on Profile Lists
 
 This document lays out general guidelines on how to think about what goes into