A load shedding tool kit for Go.
- loadshed
Load shedding is the practice of reducing work in a system that is near overload in order to the protect the system from failure. This project presents a tool kit or reference implementation of load shedding for Go. I call this a tool kit because it contains common and re-usable components that can be configured or arranged in a system rather than a monolithic tool. Load shedding must always be tailored to a specific system and there are no safe defaults. This project aims to offer a structured approach to implementing your own load shedding.
Load shedding is usually determined by a collection of rules or policies rather than a single factor. The decision making can be separated in to two categories: deterministic and probabilistic. Deterministic rules may be static or dynamic in nature and always result in a true or false indicator for shedding load. Probabilistic policies use estimations or predictions along with an aspect of chance or randomness to select requests to shed. Both deterministic and probabilistic rules are often used together to create a functioning system.
This tool kit comes with structured support for deterministic rules, probabilistic policies, and request categorization that can be used to prioritize traffic.
The main interface of the project is a type called Shedder
that arranges load
shedding rules into a single, easier to use container:
import "github.com/kevinconway/loadshed/v2"
shedder := loadshed.NewShedder()
err := shedder.Do(ctx, myAction)
var rejectionInfo ErrRejection
if err != nil && errors.As(err, &rejectionInfo) {
fmt.Println(rejectionInfo)
}
The Shedder
can be configured with any number of deterministic rules and
probabilistic policies. If the Shedder
determines that it should shed load
then the action given to Do()
is not performed and an error is returned that
details the reasons for the rejection.
Shedder
can be used directly or see the stdlib/net/http
package for an
example of how it can be more seamlessly integrated into a system as middleware.
Probabilistic polices result in load shedding based on a target rate. For example, a system may be in a state where it needs to reject 50% of requests or 80% of requests to prevent failure. The most challenging aspect of probabilistic load shedding is determining the rejection rate which must be tailored to each specific system and operation.
This project takes an opinionated and structured approach to defining rejection rates. Rejection rates are calculated as a function of the system's likelihood of failure. The likelihood of failure is calculated as a function of a specific resource's capacity utilization.
Inherent in the concept of load shedding is the concept of capacity, or the finite availability of a resource that is consumed when doing work. A capacity in this tool kit is modeled as:
type Capacity interface {
Name(ctx context.Context) string
Usage(ctx context.Context) float32
}
All metrics used in probabilistic load shedding policies must be mapped to some upper limit so that their usage can be reported as a value between 0 and 1, representing the percent utilization. Some metrics have a natural capacity, such as CPU or memory, that can be used directly. Others such as max concurrency or queue depth involve an arbitrarily defined limit based on the system's constraints. For example, a queue depth capacity might look like:
type CapacityQueueDepth[T any] struct {
Queue []T
Limit int
}
func(self *CapacityQueueDepth) Name(ctx context.Context) string {
return "QUEUE DEPTH"
}
func(self *CapacityQueueDepth) Usage(ctx context.Context) float32 {
return float32(len(self.Queue)) / float32(self.Limit)
}
The limit value would be set based on either the design of the system or a value discovered through testing.
This project contains some pre-built capacity implementations for max concurrency, error rate, landing rate, and latency or execution time.
Once a metric is reported as a percent utilization then the next step is to calculate the likelihood that the system or next request will fail due to resource exhaustion. Failure probabilities are modeled as:
type FailureProbability interface {
Capacity
Likelihood(ctx context.Context) float32
}
The likelihood calculation is highly dependent on the specific details of a system and should be determined through extensive testing. For example, if a test demonstrates that a system has an increasing risk of catastrophic failure once the CPU usage exceeds 80% and that risk increases with each point of utilization beyond 80% then a potential calculation could be:
type CPUFailureProbability struct {
*CPUCapacity
}
func (self *CPUFailureProbability) Likelihood(ctx context.Context) float32 {
usage := self.CPUCapacity.Usage(ctx)
if usage < .8 {
// No chance of failure due to CPU when under 80% utilization
return 0.0
}
return (usage - .8) / .2 // Linear interpolation between .8 and 1.0
}
The example calculates a zero percent chance of failure for any usage value below 80% and then a linearly increasing risk proportional to the proximity of the usage to 100% when over 80%. Note that this type of calculation does not fit for all systems. For systems where a linear progression is desirable then this formula can be applied to capacities using:
loadshed.NewFailureProbabilityCurveLinear(cap Capacity, lowerThreshold float32, upperThreshold float32, exponent float32)
For any calculation, the goal is to convert a percent utilization into a percent chance of failure.
The final piece of probabilistic policies is to calculate a rejection rate based on the likelihood of failure. Rejection rates are modeled as:
type RejectionRate interface {
FailureProbability
Rate(ctx context.Context) float32
}
A rate calculation requires detailed knowledge of the system and there is no
safe default formula. In systems where resources are consumed equally by all
requests then it may be possible to use the failure probability, directly, as
the rejection rate. This use case is covered by the NewRejectionRateCurveIdentity
method that wraps any FailureProbability
in a RejectionRate
that simply
returns the Likelihood()
value.
Deterministic rules shed traffic based on binary decision making. This decision making is not necessarily simple or static and may include any amount of complexity, reference to external state, or even dynamically adjusted values within a system. The primary difference between deterministic rules and probabilistic policies is that deterministic rules come to a true/false conclusion rather than a rate or chance value. Deterministic rules are modeled as:
type Rule interface {
Name(ctx context.Context) string
Reject(ctx context.Context) bool
}
The Shedder
applies all deterministic rules before applying any rejection
rates. There are no pre-built or included rules in the project. Examples of
deterministic policies to integrate include rate limiting, advanced queue
management, and quota enforcement.
A common practice for load shedding is consider the priority, or classification,
of a request and apply different rules based on that priority. The Shedder
can be given an optional classifier in the form of:
type Classification string
type Classifier interface {
Classify(ctx context.Context) Classification
}
A classification is a string identifier of the class or priority of the request. This project does not include a standard set of classifications so that they can be tailored to different use cases.
Note that the classifier is only given the request context and not a specific
request type, such as *http.Request
, so that the tool kit is more broadly
applicable to different kinds of systems. As a consequence of this choice, any
identifying information of a request that is required to determine the
classification must be set in the context. For example, here is how priority
might be set based on a target HTTP URL:
func URLExtractingMiddleware(h http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
ctx = context.WithValue(ctxKey, r.URL.Path)
h.ServeHTTP(w, r.WithContext(ctx))
})
}
const (
PriorityHigh loadshed.Classification = "HIGH"
PriorityLow loadshed.Classification = "LOW"
)
type URLClassifier struct {}
func (*URLClassifier) Classify(ctx context.Context) Classification {
path := ctx.Value(ctxKey)
if path == nil {
return PriorityLow
}
if strings.HasPrefix(path.(string), "/v2/important/endpoint") {
return PriorityHigh
}
return PriorityLow
}
shedder := loadshed.NewShedder(
loadshed.OptionShedderClassifier(&URLClassifier{}),
)
var FinalHandler = URLExtractingMiddleware(
loadshedhttp.NewMiddleware(shedder)( // from the stdlib/net/http sub-package
originalHandler,
),
)
The primary purpose of classifying requests is to create a priority hierarchy of
requests that are shed at different rates. For example, a system may have LOW
,
NORMAL
, HIGH
, and CRITICAL
classifications. As the system approaches a
likelihood of failure then it will start by rejecting only LOW
priority
requests. If the failure probability continues to increase then it will begin
to reject NORMAL
requests, and so on.
To support this use case the project includes a
NewRejectionRateCurveByClassification
constructor that converts a failure
probability into a rejection rate using a unique converting function for each
classification.
As both an example integration and a helper for a common case, this project
includes pre-build Shedder
integrations for the standard library HTTP handler
and client interfaces.
The server integration operates as a middleware for http.Handler
types and
should work with any HTTP mux or framework that targets the standard library
types.
import (
"github.com/kevinconway/loadshed/v2"
loadshedhttp "github.com/kevinconway/loadshed/v2/stdlib/net/http"
)
shedder := loadshed.NewShedder()
handler = loadshedhttp.NewMiddleware(shedder)(
handler,
)
The default behavior is to respond with a 503 Service Unavailable
response
code when a request is rejected due to load shedding. This can be modified using
HandlerOptionCallback
to set a callback that is executed when a request is
rejected. The callback takes the form of an http.Handler
and is allowed to
perform any logic and respond with any code. Callbacks can also use
FromHandlerContext
to get the rejection details to, for example, log the
reason why the request was rejected.
If one of the capacities used for load shedding is an error rate then
HandlerOptionErrCodes
must be used when constructing the middleware. This
option defines which HTTP status codes are considered errors from the
perspective of an error rate. For example, a set of error codes might include
5xx
but exclude 4xx
.
The client integration works very similarly to the server integration except
that it targets the http.RoundTripper
interface used by the http.Client
.
import (
"github.com/kevinconway/loadshed/v2"
loadshedhttp "github.com/kevinconway/loadshed/v2/stdlib/net/http"
)
shedder := loadshed.NewShedder()
transport := http.DefaultTransport.Clone()
transport = loadshedhttp.NewTransportMiddleware(shedder)(transport)
client := &http.Client{
Transport: transport,
}
When a request is rejected due to load shedding then the client calls will
return an instance of loadshed.ErrReject
which contains the details behind
why it was rejected. Similar to the server, a custom callback can be installed
to handle the rejection error using TransportOptionCallback
.
If one of the capacities used for load shedding is an error rate then
TransportOptionErrorCodes
must be used when constructing the middleware. This
option defines which HTTP status codes are considered errors from the
perspective of an error rate.
go get github.com/kevinconway/loadshed/v2
This project has no hard dependencies on any build tools other than Go. You
should be able to run go test
for any changes and see the results.
If you prefer, the project includes a Makefile with the following rules:
update
- Update dependencies in the go.mod file.bin
- Download all optional build and test tools.fmt
- Rungoimports
on all Go source files.test
- Run all tests and create a test coverage record.coverage
- Generate a series of coverage reports from test records.
For bugs or performance improvements, I welcome pull requests, issues, or
comments. If you make a pull request then please be sure to add tests and run
make fmt
.
For new features, please start a discussion first by creating an issue and explaining the intended change.
This project includes an integration for the standard library HTTP tools. If
there are other, obvious standard library tools that would benefit from load
shedding then I'd accept an addition to the stdlib
sub-package. However, I
won't maintain integrations with 3rd party tools in this repository.
I was part of the original team that built this library for a previous employer. The project was originally published as bitbucket.org/stride/loadshed and was later transferred to github.com/asecurityteam/loadshed. Since then, the company and team priorities have changed and github.com/asecurityteam/loadshed has been archived.
I have new use cases for this library so I'm maintaining this fork.
For convenience, I have created a v1.2.0
tag that matches the last published
release of github.com/asecurityteam/loadshed. The only difference is that I have
updated the module path to github.com/kevinconway/loadshed
and replaced the
github.com/asecurityteam/rolling
dependency with
github.com/kevinconway/rolling
which I have also forked for the same reasons
as stated above.. You should be able to replace
github.com/asecurityteam/loadshed
with github.com/kevinconway/loadshed
in
either your source code or your go.mod using a replace
directive to pull from
here instead.
There is no particular advantage to doing this, today, unless it's part of a gradual migration to v2 (the version documented in this file). I do not plan on porting any bug fixes or performance improvements to v1.
This project is licensed under the Apache 2.0 license. See LICENSE.txt or http://www.apache.org/licenses/LICENSE-2.0 for the full terms.
This project is forked from https://github.com/asecurityteam/loadshed. Though, the majority of the current content is actually a modified form of a public, but unmerged, branch of the project from https://bitbucket.org/atlassian/loadshed/src/dev-2.0/. The original project's copyright attribution and license terms are:
Copyright @ 2017 Atlassian Pty Ltd
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
All files are marked with SPDX tags that both attribute the original copyright as well as identify the author(s) of any significant changes to those files.