From d637124316f7a4ee45dfeecd0e21919cbbddd7b0 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Fri, 8 Dec 2023 18:57:06 +0200 Subject: [PATCH 01/12] Start an event preparation guide --- docs/howto/event-prep.md | 57 +++++++++++++++++++ docs/howto/features/dedicated-nodepool.md | 1 + .../cloud-accounts/new-aws-account.md | 1 + .../cloud-accounts/new-gcp-project.md | 1 + docs/index.md | 1 + docs/topic/infrastructure/cluster-design.md | 1 + docs/topic/resource-allocation.md | 1 + 7 files changed, 63 insertions(+) create mode 100644 docs/howto/event-prep.md diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md new file mode 100644 index 0000000000..f03fcd7bcc --- /dev/null +++ b/docs/howto/event-prep.md @@ -0,0 +1,57 @@ +# Decide if the infrastructure needs preparation before an Event + +A hub's specific setup is usually optimized based on the day to day usage expectations. But because events provide a different pattern of usage, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. + +The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. + +The events might vary in type, so the following list is not complete and does not cover all of them (yet). + +Most common event types are: + +1. Exams +2. Workshops + +## Event checklist + +Below are listed the main aspects to consider adjusting to prepare a hub for an event: + +### 1. Quotas + +We must ensure that the quotas from the cloud provider are high-enough to handle expected usage. It might be that the number of users attending the event is very big, or their expected resource usage is big, or both. Either way, we need to check the the existing quotas will accommodate the new numbers. + +- [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) has information about how to check the quotas in an AWS project +- [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) has information about how to check the quotas in a GCP quotas + +### 2. Dedicated nodepool - shared clusters + +If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, not impacting other hub's users performance. + +Follow the guide at [](features:shared-cluster:dedicated-nodepool) in order to setup a dedicated nodepool before an event. + +### 3. Pre-warming + +There are two mechanisms that we can use to pre-warm a hub before an event: using node sharing via profile lists and by setting a minimum node count. + +```{note} +You can read more about what to consider when setting resource allocation options in profile lists in [](topic:resource-allocation). +``` + +Specifically for events, the node sharing benefits via profile lists vs. setting a minimum node count are: + +- it shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to our [](topic:cluster-design:instance-type) that should cover most usage needs +- we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs +- the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready. + +```{warning} +However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case. +``` + +#### 3.1. Using node sharing + +```{important} +Currently, this is the recommended way to prepare a hub before an event if the hub uses profile lists already. +``` + +#### 3.2. By setting a minimum node count for the autoscaler on a specific node pool + + diff --git a/docs/howto/features/dedicated-nodepool.md b/docs/howto/features/dedicated-nodepool.md index 79bc5b82b2..e41bd41f40 100644 --- a/docs/howto/features/dedicated-nodepool.md +++ b/docs/howto/features/dedicated-nodepool.md @@ -1,3 +1,4 @@ +(features:shared-cluster:dedicated-nodepool)= # Setup a dedicated nodepool for a hub on a shared cluster Some hubs on shared clusters require dedicated nodepools, for a few reasons: diff --git a/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md b/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md index 16d21355e0..7e0000c3c2 100644 --- a/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md +++ b/docs/hub-deployment-guide/cloud-accounts/new-aws-account.md @@ -43,6 +43,7 @@ More information on these terms can be found in [](cloud-access:aws). You have successfully created a new AWS account and connected it to our AWS Organization's Management Account! Now, [setup a new cluster](new-cluster:aws) inside it via Terraform. +(hub-deployment-guide:cloud-accounts:aws-quotas)= ## Checking quotas and requesting increases Cloud providers like AWS require their users to request a _Service Quota diff --git a/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md b/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md index 5353636ddb..1efccf79dd 100644 --- a/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md +++ b/docs/hub-deployment-guide/cloud-accounts/new-gcp-project.md @@ -23,6 +23,7 @@ ``` 7. [Setup a new cluster](new-cluster:new-cluster) inside it via Terraform +(hub-deployment-guide:cloud-accounts:gcp-quotas)= ## Checking quotas and requesting increases Finally, we should check what quotas are enforced on the project and increase them as necessary. diff --git a/docs/index.md b/docs/index.md index 5ed3dcadce..1f48f031b6 100644 --- a/docs/index.md +++ b/docs/index.md @@ -71,6 +71,7 @@ howto/features/index.md howto/bill.md howto/custom-jupyterhub-image.md howto/exam.md +howto/event-prep.md howto/manage-domains/index.md howto/grafana-github-auth.md howto/update-env.md diff --git a/docs/topic/infrastructure/cluster-design.md b/docs/topic/infrastructure/cluster-design.md index cae71f8c06..39f77ccc72 100644 --- a/docs/topic/infrastructure/cluster-design.md +++ b/docs/topic/infrastructure/cluster-design.md @@ -77,6 +77,7 @@ On GKE clusters with network policy enforcement, we look to edit the `calico-typha-horizontal-autoscaler` ConfigMap in `kube-system` to avoid scaling up to two replicas unless there are very many nodes in the k8s cluster. +(topic:cluster-design:instance-type)= ### Our instance type choice #### For nodes where core services will be scheduled on diff --git a/docs/topic/resource-allocation.md b/docs/topic/resource-allocation.md index 479faeaa51..3f6a94dc36 100644 --- a/docs/topic/resource-allocation.md +++ b/docs/topic/resource-allocation.md @@ -1,3 +1,4 @@ +(topic:resource-allocation)= # Resource Allocation on Profile Lists This document lays out general guidelines on how to think about what goes into From 4e9a0128f919031257c162eb65bc6b101b236963 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Mon, 11 Dec 2023 17:13:33 +0200 Subject: [PATCH 02/12] Add a checklsit for pre-warming --- docs/howto/event-prep.md | 80 +++++++++++++++++++++++++++++----------- 1 file changed, 59 insertions(+), 21 deletions(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index f03fcd7bcc..7aa9ffd708 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -1,46 +1,58 @@ # Decide if the infrastructure needs preparation before an Event -A hub's specific setup is usually optimized based on the day to day usage expectations. But because events provide a different pattern of usage, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. +A hub's specific setup is usually optimized based on the day to day usage expectations. But because events usually imply a different usage pattern, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. -The events might vary in type, so the following list is not complete and does not cover all of them (yet). - -Most common event types are: - -1. Exams -2. Workshops +The events might vary in type, so the following list is not complete and does not cover all of them (yet) Most common event types are exams, workshops etc. ## Event checklist -Below are listed the main aspects to consider adjusting to prepare a hub for an event: +Below are listed the main aspects to consider adjusting on a hub to prepare it for an event: -### 1. Quotas +### 1. Check the quotas We must ensure that the quotas from the cloud provider are high-enough to handle expected usage. It might be that the number of users attending the event is very big, or their expected resource usage is big, or both. Either way, we need to check the the existing quotas will accommodate the new numbers. -- [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) has information about how to check the quotas in an AWS project -- [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) has information about how to check the quotas in a GCP quotas +```{tip} +- follow the [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in an AWS project +- follow [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project +``` -### 2. Dedicated nodepool - shared clusters +### 2. Consider dedicated nodepools on shared clusters If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, not impacting other hub's users performance. +```{tip} Follow the guide at [](features:shared-cluster:dedicated-nodepool) in order to setup a dedicated nodepool before an event. +``` -### 3. Pre-warming +### 3. Pre-warm the hub to reduce wait times -There are two mechanisms that we can use to pre-warm a hub before an event: using node sharing via profile lists and by setting a minimum node count. +There are two mechanisms that we can use to pre-warm a hub before an event: + - making sure some **nodes are ready** when users arrive + + This can be done using node sharing via profile lists or by setting a minimum node count. + ```{note} + You can read more about what to consider when setting resource allocation options in profile lists in [](topic:resource-allocation). + ``` + + - the user **image is not huge**, otherwise pre-pulling it must be considered -```{note} -You can read more about what to consider when setting resource allocation options in profile lists in [](topic:resource-allocation). -``` Specifically for events, the node sharing benefits via profile lists vs. setting a minimum node count are: -- it shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to our [](topic:cluster-design:instance-type) that should cover most usage needs -- we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs -- the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready. + - **no `terraform/eks` infrastructure changes** + + they shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to [](topic:cluster-design:instance-type) that should cover most usage needs + + - **more cost flexibility** + + we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs + + - **less engineering intervention needed** + + the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready. ```{warning} However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case. @@ -49,9 +61,35 @@ However, for some communities that don't already use profile lists, setting up o #### 3.1. Using node sharing ```{important} -Currently, this is the recommended way to prepare a hub before an event if the hub uses profile lists already. +Currently, this is the recommended way to prepare a hub before an event if the hub uses profile lists. ``` +Assuming this hub already has a profile list, before an event, you should check the following: + +1. **Information is avalailable** + + Make sure the information in the event GitHub issue was filled in, especially the number of expected users before an event and their expected resource needs (if that can be known by the community beforehand). + +2. **Given the current setup, calculate** + + - how many users will fit on a node? + - how many nodes will be necessary during the event? + +3. **Check some rules** + + With the numbers you got, check the following general rules are respected: + + - **Startup time** + - have at least `3-4 people on a node` but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) as few users per node cause longer startup times + - `no more than 30% of the users waiting for a node` to come up + - For events, we wish to enforce memory constraints that can easily be observed and understood. We might want to consider having an oversubscription factor of 1. + With this setup, when the limit is reached, the process inside container will be killed and typically in this situation, the kernel dies. + + +3. **Tilt the balance towards reducing server startup time** + +https://infrastructure.2i2c.org/topic/resource-allocation/#factors-to-balance + #### 3.2. By setting a minimum node count for the autoscaler on a specific node pool From 1fb316fc80159506bfb50ed316c50bd83e36082b Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Tue, 12 Dec 2023 16:53:32 +0200 Subject: [PATCH 03/12] Add more details about oversubsctiption --- docs/howto/event-prep.md | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 7aa9ffd708..56529b51e5 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -4,7 +4,7 @@ A hub's specific setup is usually optimized based on the day to day usage expect The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. -The events might vary in type, so the following list is not complete and does not cover all of them (yet) Most common event types are exams, workshops etc. +The events might vary in type, so the following list is not complete and does not cover all of them (yet). Most common event types are exams, workshops etc. ## Event checklist @@ -66,25 +66,44 @@ Currently, this is the recommended way to prepare a hub before an event if the h Assuming this hub already has a profile list, before an event, you should check the following: -1. **Information is avalailable** +1. **Information is available** Make sure the information in the event GitHub issue was filled in, especially the number of expected users before an event and their expected resource needs (if that can be known by the community beforehand). 2. **Given the current setup, calculate** - - how many users will fit on a node? - - how many nodes will be necessary during the event? + - x = how many users will fit on a node? 3. **Check some rules** - With the numbers you got, check the following general rules are respected: + Check that `x` respects the following general rules: - - **Startup time** - - have at least `3-4 people on a node` but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) as few users per node cause longer startup times - - `no more than 30% of the users waiting for a node` to come up - - For events, we wish to enforce memory constraints that can easily be observed and understood. We might want to consider having an oversubscription factor of 1. - With this setup, when the limit is reached, the process inside container will be killed and typically in this situation, the kernel dies. + - **Minimize startup time** + - have at least `3-4 people on a node` as few users per node cause longer startup times + - `no more than 30% of the users waiting for a node` to come up, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) + + ```{admonition} Action to take + :class: tip + + If `x` doesn't respect the rules above, you should adjust the instance type. + ``` + + - **Don't oversubscribe resources** + + The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node. + + However, a bigger oversubscription factor also means that the users that use more resources than they are guaranteed can get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users and the hub, so we should try and avoid this during events. + + ````{admonition} Action to take + :class: tip + + If the hub is setup so that the oversubscription factor of memory is greater than 1, you should consider changing it. For this you can use the deployer script by passing it the instance type where the pods will be scheduled on, in this example is `n2-highmem-4` and pick the choice(s) that will be used during the event based on expected usage. + + ```bash + deployer generate resource-allocation choices n2-highmem-4 + ``` + ```` 3. **Tilt the balance towards reducing server startup time** From e5309131a28a2d4e22fb8f7521ad8317ca424794 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Tue, 12 Dec 2023 18:35:54 +0200 Subject: [PATCH 04/12] Restructure the content a bit morE --- docs/conf.py | 1 + docs/howto/event-prep.md | 106 +++++++++++++++++++++++---------------- 2 files changed, 63 insertions(+), 44 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index 9f93872895..c6fb4840e6 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -17,6 +17,7 @@ "sphinx_design", "sphinxcontrib.mermaid", "sphinxcontrib.jquery", + "sphinx_togglebutton", ] intersphinx_mapping = { diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 56529b51e5..0f2f826334 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -2,7 +2,9 @@ A hub's specific setup is usually optimized based on the day to day usage expectations. But because events usually imply a different usage pattern, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. +```{important} The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. +``` The events might vary in type, so the following list is not complete and does not cover all of them (yet). Most common event types are exams, workshops etc. @@ -10,11 +12,12 @@ The events might vary in type, so the following list is not complete and does no Below are listed the main aspects to consider adjusting on a hub to prepare it for an event: -### 1. Check the quotas +### 1. Quotas We must ensure that the quotas from the cloud provider are high-enough to handle expected usage. It might be that the number of users attending the event is very big, or their expected resource usage is big, or both. Either way, we need to check the the existing quotas will accommodate the new numbers. -```{tip} +```{admonition} Action to take +:class: tip - follow the [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in an AWS project - follow [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project ``` @@ -23,92 +26,107 @@ We must ensure that the quotas from the cloud provider are high-enough to handle If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, not impacting other hub's users performance. -```{tip} +```{admonition} Action to take +:class: tip Follow the guide at [](features:shared-cluster:dedicated-nodepool) in order to setup a dedicated nodepool before an event. ``` ### 3. Pre-warm the hub to reduce wait times There are two mechanisms that we can use to pre-warm a hub before an event: - - making sure some **nodes are ready** when users arrive + +- making sure some **nodes are ready** when users arrive This can be done using node sharing via profile lists or by setting a minimum node count. + ```{note} You can read more about what to consider when setting resource allocation options in profile lists in [](topic:resource-allocation). ``` - - the user **image is not huge**, otherwise pre-pulling it must be considered - + ```{admonition} Expand this to find out the benefits of node sharing via profile lists + :class: dropdown + Specifically for events, the node sharing benefits via profile lists vs. setting a minimum node count are: -Specifically for events, the node sharing benefits via profile lists vs. setting a minimum node count are: + - **no `terraform/eks` infrastructure changes** - - **no `terraform/eks` infrastructure changes** + they shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to [](topic:cluster-design:instance-type) that should cover most usage needs - they shouldn't require modifying terraform/eks code in order to change the underlying cluster architecture thanks to [](topic:cluster-design:instance-type) that should cover most usage needs + - **more cost flexibility** - - **more cost flexibility** + we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs - we can setup the infrastructure a few days before the event by opening a PR, and then just merge it as close to the event as possible. Deploying an infrastructure change for an event a few days before isn't as costly as starting "x" nodes before, which required an engineer to be available to make terraform changes as close to the event as possible due to costs + - **less engineering intervention needed** - - **less engineering intervention needed** + the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready. + ``` - the instructors are empowered to "pre-warm" the hub by starting notebook servers on nodes they wish to have ready. +- the user **image is not huge**, otherwise pre-pulling it must be considered -```{warning} -However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case. -``` -#### 3.1. Using node sharing +#### 3.1. Node sharing via profile lists ```{important} -Currently, this is the recommended way to prepare a hub before an event if the hub uses profile lists. +Currently, this is the recommended way to handle an event on a hub. However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case. ``` +During events, we want to tilt the balance towards reducing server startup time. The docs at [](topic:resource-allocation) have more information about all the factors that should be considered during resource allocation. + Assuming this hub already has a profile list, before an event, you should check the following: 1. **Information is available** Make sure the information in the event GitHub issue was filled in, especially the number of expected users before an event and their expected resource needs (if that can be known by the community beforehand). -2. **Given the current setup, calculate** +2. **Given the current setup, calculate how many users will fit on a node?** - - x = how many users will fit on a node? + Check that the current number of users/node respects the following general event wishlist. -3. **Check some rules** +3. **Minimize startup time** - Check that `x` respects the following general rules: + - have at least `3-4 people on a node` as few users per node cause longer startup times, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) + - don't have more than 30% of the users waiting for a node to come up - - **Minimize startup time** + ```{admonition} Action to take + :class: tip - - have at least `3-4 people on a node` as few users per node cause longer startup times - - `no more than 30% of the users waiting for a node` to come up, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) - - ```{admonition} Action to take - :class: tip - - If `x` doesn't respect the rules above, you should adjust the instance type. - ``` + If the current number of users per node doesn't respect the rules above, you should adjust the instance type so that it does. + ``` - - **Don't oversubscribe resources** +4. **Don't oversubscribe resources** + The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node. - The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node. + However, a bigger oversubscription factor also means that the users that use more resources than they are guaranteed can get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users and the hub, so we should try and avoid this during events. - However, a bigger oversubscription factor also means that the users that use more resources than they are guaranteed can get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users and the hub, so we should try and avoid this during events. + ````{admonition} Action to take + :class: tip - ````{admonition} Action to take - :class: tip + For an event, you should consider an oversubscription factor of 1. For this you can use the deployer script by passing it the instance type where the pods will be scheduled on (in this example is `n2-highmem-4`), then from its output, pick the choice(s) that will be used during the event based on expected usage. - If the hub is setup so that the oversubscription factor of memory is greater than 1, you should consider changing it. For this you can use the deployer script by passing it the instance type where the pods will be scheduled on, in this example is `n2-highmem-4` and pick the choice(s) that will be used during the event based on expected usage. + ```bash + deployer generate resource-allocation choices n2-highmem-4 + ``` + ```` - ```bash - deployer generate resource-allocation choices n2-highmem-4 - ``` - ```` +````{warning} +Note that if you are changing the instance type, you should also consider re-writing the allocation options, especially if you are going with a smaller machine than the original one. -3. **Tilt the balance towards reducing server startup time** +```bash +deployer generate resource-allocation choices n2-highmem-4 +``` +```` -https://infrastructure.2i2c.org/topic/resource-allocation/#factors-to-balance +#### 3.2. Setting a minimum node count on a specific node pool + TODO -#### 3.2. By setting a minimum node count for the autoscaler on a specific node pool +#### 3.3. Pre-pulling the image + TODO. Relevant discussions: + - https://github.com/2i2c-org/infrastructure/issues/2541 + - https://github.com/2i2c-org/infrastructure/pull/3313 + - https://github.com/2i2c-org/infrastructure/pull/3341 +```{important} +To get a deeper understanding of the resource allocation topic, you can read up these issues: +- https://github.com/2i2c-org/infrastructure/issues/2121 +- +``` \ No newline at end of file From 6cdbf9ad2c6dd162ec0be85e39a12f110b6dc491 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Tue, 12 Dec 2023 18:46:15 +0200 Subject: [PATCH 05/12] Move warning higher --- docs/howto/event-prep.md | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 0f2f826334..16d436c789 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -86,11 +86,16 @@ Assuming this hub already has a profile list, before an event, you should check - have at least `3-4 people on a node` as few users per node cause longer startup times, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) - don't have more than 30% of the users waiting for a node to come up - ```{admonition} Action to take + ````{admonition} Action to take :class: tip If the current number of users per node doesn't respect the rules above, you should adjust the instance type so that it does. + Note that if you are changing the instance type, you should also consider re-writing the allocation options, especially if you are going with a smaller machine than the original one. + + ```bash + deployer generate resource-allocation choices ``` + ```` 4. **Don't oversubscribe resources** The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node. @@ -107,26 +112,22 @@ Assuming this hub already has a profile list, before an event, you should check ``` ```` -````{warning} -Note that if you are changing the instance type, you should also consider re-writing the allocation options, especially if you are going with a smaller machine than the original one. - -```bash -deployer generate resource-allocation choices n2-highmem-4 -``` -```` #### 3.2. Setting a minimum node count on a specific node pool - TODO +TODO #### 3.3. Pre-pulling the image - TODO. Relevant discussions: - - https://github.com/2i2c-org/infrastructure/issues/2541 - - https://github.com/2i2c-org/infrastructure/pull/3313 - - https://github.com/2i2c-org/infrastructure/pull/3341 +TODO. Relevant discussions: +- https://github.com/2i2c-org/infrastructure/issues/2541 +- https://github.com/2i2c-org/infrastructure/pull/3313 +- https://github.com/2i2c-org/infrastructure/pull/3341 ```{important} -To get a deeper understanding of the resource allocation topic, you can read up these issues: +To get a deeper understanding of the resource allocation topic, you can read up these issues and documentation pieces: - https://github.com/2i2c-org/infrastructure/issues/2121 -- +- https://github.com/2i2c-org/infrastructure/pull/3030 +- https://github.com/2i2c-org/infrastructure/issues/3132 +- https://github.com/2i2c-org/infrastructure/issues/3293 +- https://infrastructure.2i2c.org/topic/resource-allocation/#factors-to-balance ``` \ No newline at end of file From cdc6d04058d8dfc051fed8fed4bcb8aeb33501f4 Mon Sep 17 00:00:00 2001 From: Georgiana Date: Wed, 13 Dec 2023 15:41:11 +0200 Subject: [PATCH 06/12] Fix missing word Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- docs/howto/event-prep.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 16d436c789..ebc6f0e991 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -19,7 +19,7 @@ We must ensure that the quotas from the cloud provider are high-enough to handle ```{admonition} Action to take :class: tip - follow the [AWS quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in an AWS project -- follow [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project +- follow the [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project ``` ### 2. Consider dedicated nodepools on shared clusters From 7a3aa93ce7d2a59f47a970e08b2f0fdfa0eb88d2 Mon Sep 17 00:00:00 2001 From: Georgiana Date: Wed, 13 Dec 2023 15:41:28 +0200 Subject: [PATCH 07/12] Reword for clarity Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- docs/howto/event-prep.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index ebc6f0e991..95594b2b47 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -24,7 +24,7 @@ We must ensure that the quotas from the cloud provider are high-enough to handle ### 2. Consider dedicated nodepools on shared clusters -If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, not impacting other hub's users performance. +If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, and avoid impacting other hub's users performance. ```{admonition} Action to take :class: tip From c8c426a4cadaa870d0e8f640ba84eb7a4164f31b Mon Sep 17 00:00:00 2001 From: Georgiana Date: Wed, 13 Dec 2023 16:01:47 +0200 Subject: [PATCH 08/12] Turn todo into a warning Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- docs/howto/event-prep.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 95594b2b47..2ade863933 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -114,7 +114,9 @@ Assuming this hub already has a profile list, before an event, you should check #### 3.2. Setting a minimum node count on a specific node pool -TODO +```{warning} +This section is a Work in Progress! +``` #### 3.3. Pre-pulling the image TODO. Relevant discussions: From 4e292cc53edcd68029977722b37db8d6f1d797a4 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Wed, 13 Dec 2023 16:04:55 +0200 Subject: [PATCH 09/12] Turn todo into a wip warning --- docs/howto/event-prep.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 2ade863933..69bd15ef67 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -119,12 +119,16 @@ This section is a Work in Progress! ``` #### 3.3. Pre-pulling the image -TODO. Relevant discussions: + +```{warning} +This section is a Work in Progress! +``` + +Relevant discussions: - https://github.com/2i2c-org/infrastructure/issues/2541 - https://github.com/2i2c-org/infrastructure/pull/3313 - https://github.com/2i2c-org/infrastructure/pull/3341 - ```{important} To get a deeper understanding of the resource allocation topic, you can read up these issues and documentation pieces: - https://github.com/2i2c-org/infrastructure/issues/2121 From 6f94462d731ac00e315271d47fa6c1d6789b2eff Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Thu, 14 Dec 2023 15:34:59 +0200 Subject: [PATCH 10/12] Add resource allocation script example --- docs/howto/event-prep.md | 79 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 74 insertions(+), 5 deletions(-) diff --git a/docs/howto/event-prep.md b/docs/howto/event-prep.md index 69bd15ef67..dd40bedce6 100644 --- a/docs/howto/event-prep.md +++ b/docs/howto/event-prep.md @@ -84,6 +84,7 @@ Assuming this hub already has a profile list, before an event, you should check 3. **Minimize startup time** - have at least `3-4 people on a node` as few users per node cause longer startup times, but [no more than ~100]( https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=No%20more%20than%20110%20pods,more%20than%20300%2C000%20total%20containers) + - don't have more than 30% of the users waiting for a node to come up ````{admonition} Action to take @@ -92,26 +93,94 @@ Assuming this hub already has a profile list, before an event, you should check If the current number of users per node doesn't respect the rules above, you should adjust the instance type so that it does. Note that if you are changing the instance type, you should also consider re-writing the allocation options, especially if you are going with a smaller machine than the original one. - ```bash + ```{code-block} deployer generate resource-allocation choices ``` ```` 4. **Don't oversubscribe resources** + The oversubscription factor is how much larger a limit is than the actual request (aka, the minimum guaranteed amount of a resource that is reserved for a container). When this factor is greater, then a more efficient node packing can be achieved because usually most users don't use resources up to their limit, and more users can fit on a node. However, a bigger oversubscription factor also means that the users that use more resources than they are guaranteed can get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users and the hub, so we should try and avoid this during events. - ````{admonition} Action to take + `````{admonition} Action to take :class: tip - For an event, you should consider an oversubscription factor of 1. For this you can use the deployer script by passing it the instance type where the pods will be scheduled on (in this example is `n2-highmem-4`), then from its output, pick the choice(s) that will be used during the event based on expected usage. + For an event, you should consider an oversubscription factor of 1. + + - if the instance type remains unchanged, then just adjust the limit to match the memory guarantee if not already the case + + - if the instance type also changes, then you can use the `deployer generate resource-allocation` command, passing it the new instance type and optionally the number of choices. + + You can then use its output to: + - either replace all allocation options with the ones for the new node type + - or pick the choice(s) that will be used during the event based on expected usage and just don't show the others - ```bash - deployer generate resource-allocation choices n2-highmem-4 + ````{admonition} Example + For example, if the community expects to only use ~3GB of memory during an event, and no other users are expected to use the hub for the duration of the event, then you can choose to only make available that one option. + + Assuming they had 4 options on a `n2-highmem-2` machine and we wish to move them on a `n2-highmem-4` for the event, we could run: + + ```{code-block} + deployer generate resource-allocation choices n2-highmem-4 --num-allocations 4 + ``` + + which will output: + + ```{code-block} + # pick this option to present the single ~3GB memory option for the event + mem_3_4: + display_name: 3.4 GB RAM, upto 3.485 CPUs + kubespawner_override: + mem_guarantee: 3662286336 + mem_limit: 3662286336 + cpu_guarantee: 0.435625 + cpu_limit: 3.485 + node_selector: + node.kubernetes.io/instance-type: n2-highmem-4 + default: true + mem_6_8: + display_name: 6.8 GB RAM, upto 3.485 CPUs + kubespawner_override: + mem_guarantee: 7324572672 + mem_limit: 7324572672 + cpu_guarantee: 0.87125 + cpu_limit: 3.485 + node_selector: + node.kubernetes.io/instance-type: n2-highmem-4 + (...2 more options) + ``` + And we would have this in the profileList configuration: + ```{code-block} + profileList: + - display_name: Workshop + description: Workshop environment + default: true + kubespawner_override: + image: python:6ee57a9 + profile_options: + requests: + display_name: Resource Allocation + choices: + mem_3_4: + display_name: 3.4 GB RAM, upto 3.485 CPUs + kubespawner_override: + mem_guarantee: 3662286336 + mem_limit: 3662286336 + cpu_guarantee: 0.435625 + cpu_limit: 3.485 + node_selector: + node.kubernetes.io/instance-type: n2-highmem-4 ``` ```` + ````{warning} + The `deployer generate resource-allocation`: + - cam only generate options where guarantees (requests) equal limits! + - supports the instance types located in `node-capacity-info.json` file + ```` + ````` #### 3.2. Setting a minimum node count on a specific node pool ```{warning} From 090336b9dacbdffd67dbf4a4f78e8147a9df8285 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Thu, 14 Dec 2023 17:09:57 +0200 Subject: [PATCH 11/12] Put exams and event prep into the same dir --- docs/howto/{ => prepare-for-events}/event-prep.md | 0 docs/howto/{ => prepare-for-events}/exam.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename docs/howto/{ => prepare-for-events}/event-prep.md (100%) rename docs/howto/{ => prepare-for-events}/exam.md (100%) diff --git a/docs/howto/event-prep.md b/docs/howto/prepare-for-events/event-prep.md similarity index 100% rename from docs/howto/event-prep.md rename to docs/howto/prepare-for-events/event-prep.md diff --git a/docs/howto/exam.md b/docs/howto/prepare-for-events/exam.md similarity index 100% rename from docs/howto/exam.md rename to docs/howto/prepare-for-events/exam.md From ac9e3760fc1f72fe8c7ddeef98eb22e375927234 Mon Sep 17 00:00:00 2001 From: Georgiana Dolocan Date: Thu, 14 Dec 2023 17:40:35 +0200 Subject: [PATCH 12/12] Crete a separate section for events --- docs/howto/prepare-for-events/event-prep.md | 24 ++++++--------------- docs/howto/prepare-for-events/index.md | 15 +++++++++++++ docs/index.md | 3 +-- 3 files changed, 23 insertions(+), 19 deletions(-) create mode 100644 docs/howto/prepare-for-events/index.md diff --git a/docs/howto/prepare-for-events/event-prep.md b/docs/howto/prepare-for-events/event-prep.md index dd40bedce6..4cdb4e98e2 100644 --- a/docs/howto/prepare-for-events/event-prep.md +++ b/docs/howto/prepare-for-events/event-prep.md @@ -1,18 +1,8 @@ -# Decide if the infrastructure needs preparation before an Event - -A hub's specific setup is usually optimized based on the day to day usage expectations. But because events usually imply a different usage pattern, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. - -```{important} -The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. -``` - -The events might vary in type, so the following list is not complete and does not cover all of them (yet). Most common event types are exams, workshops etc. - -## Event checklist +# Event infrastructure preparation checklist Below are listed the main aspects to consider adjusting on a hub to prepare it for an event: -### 1. Quotas +## 1. Quotas We must ensure that the quotas from the cloud provider are high-enough to handle expected usage. It might be that the number of users attending the event is very big, or their expected resource usage is big, or both. Either way, we need to check the the existing quotas will accommodate the new numbers. @@ -22,7 +12,7 @@ We must ensure that the quotas from the cloud provider are high-enough to handle - follow the [GCP quota guide](hub-deployment-guide:cloud-accounts:aws-quotas) for information about how to check the quotas in a GCP project ``` -### 2. Consider dedicated nodepools on shared clusters +## 2. Consider dedicated nodepools on shared clusters If the hub that's having an event is running on a shared cluster, then we might want to consider putting it on a dedicated nodepool as that will help with cost isolation, scaling up/down effectively, and avoid impacting other hub's users performance. @@ -31,7 +21,7 @@ If the hub that's having an event is running on a shared cluster, then we might Follow the guide at [](features:shared-cluster:dedicated-nodepool) in order to setup a dedicated nodepool before an event. ``` -### 3. Pre-warm the hub to reduce wait times +## 3. Pre-warm the hub to reduce wait times There are two mechanisms that we can use to pre-warm a hub before an event: @@ -63,7 +53,7 @@ There are two mechanisms that we can use to pre-warm a hub before an event: - the user **image is not huge**, otherwise pre-pulling it must be considered -#### 3.1. Node sharing via profile lists +### 3.1. Node sharing via profile lists ```{important} Currently, this is the recommended way to handle an event on a hub. However, for some communities that don't already use profile lists, setting up one just before an event might be confusing, we might want to consider setting up a minimum node count in this case. @@ -182,12 +172,12 @@ Assuming this hub already has a profile list, before an event, you should check ```` ````` -#### 3.2. Setting a minimum node count on a specific node pool +### 3.2. Setting a minimum node count on a specific node pool ```{warning} This section is a Work in Progress! ``` -#### 3.3. Pre-pulling the image +### 3.3. Pre-pulling the image ```{warning} This section is a Work in Progress! diff --git a/docs/howto/prepare-for-events/index.md b/docs/howto/prepare-for-events/index.md new file mode 100644 index 0000000000..78e243250c --- /dev/null +++ b/docs/howto/prepare-for-events/index.md @@ -0,0 +1,15 @@ +# Manage events on 2i2c hubs + +A hub's specific setup is usually optimized based on the day to day usage expectations. But because events usually imply a different usage pattern, the infrastructure might need to be adjusted in order to accommodate the spikes in activity. + +```{important} +The communities we serve have the responsibility to notify us about an event they have planned on a 2i2c hub [at least three weeks before](https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event) the event will start. This should allow us enough time to plan and prepare the infrastructure for the event properly if needed. +``` + +The events might vary in type, so the following list is not complete and does not cover all of them (yet). Most common event types are exams, workshops etc. + +```{toctree} +:maxdepth: 2 +event-prep.md +exam.md +``` diff --git a/docs/index.md b/docs/index.md index 1f48f031b6..c946599a54 100644 --- a/docs/index.md +++ b/docs/index.md @@ -70,8 +70,7 @@ deployed occasionally as a specific addition. howto/features/index.md howto/bill.md howto/custom-jupyterhub-image.md -howto/exam.md -howto/event-prep.md +howto/prepare-for-events/index.md howto/manage-domains/index.md howto/grafana-github-auth.md howto/update-env.md