-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor SLO calculations #91
Conversation
* Simplify the main logic by using absolute replica counts instead of relative. * Rename `get_new_worker_count` to `get_target_replica_count`. * Move `max_replicas` and `min_replicas` a level higher in the config. The limits are generic enough to be handled by the main logic. * Do not call the Kubernetes autoscaling API if no change is needed. Fixes #38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but should probably have a pair of eyes on it that didn't have a hand in writing it.
While trying to work through the logic here, I wondered a little bit of there was existing literature on this that we could crib from, rather than deriving things from first principles (which is what I've been doing).
capacity = cfg["autoscale"]["args"]["max_replicas"] - running | ||
log_env["capacity"] = capacity | ||
max_replicas = cfg["autoscale"]["max_replicas"] | ||
min_replicas = cfg["autoscale"]["min_replicas"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to duplicate the value above (both as a local and in the log).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, good catch. Will fix it.
adjust_scale(api, target_replicas, cfg["deployment_namespace"], cfg["deployment_name"]) | ||
logger.info("Calculating target replica count", extra=log_env) | ||
target_replicas = get_target_replica_count(pending, running, cfg["autoscale"]["args"]) | ||
target_replicas = max(min(target_replicas, max_replicas), min_replicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this calculation deserves a comment on a function.
# = running + ((pending - (running*(tasks_per_replica-1))) / tasks_per_replica) | ||
# = running + ((pending + running - running*tasks_per_replica) / tasks_per_replica) | ||
# = running + ((pending + running) / tasks_per_replica - running) | ||
# = (pending + running) / tasks_per_replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this algebra is useful for showing that the new and old calculations are equivalent; it is not clear to me how useful it is on its own (particularly given that it refers to variable names that only exist in the old logic.
# How many tasks a replica can process within our tolerance period | ||
new_tasks_per_replica = math.floor(args["slo_seconds"] / args["avg_task_duration"]) | ||
tasks_per_replica = math.floor(args["slo_seconds"] / args["avg_task_duration"]) | ||
# how many tasks can be covered by the running replicas, assuming they are | ||
# busy and can only take new tasks after they are done with the current one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs updating.
needed_replicas = math.ceil(outstanding / tasks_per_replica) | ||
# Do not scale down in case we have pending, because some workers will | ||
# receive SIGUSR1 and won't take any tasks after that. | ||
needed_replicas = max(needed_replicas, running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah. I lost a long comment I had on this line, suggesting we get rid of clamping here, as I needed_replicas
is already an overestimate when needed_replicas < running
(assuming the workers complete the tasks that they are running before shutting down), so clamping it makes our estimate even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes a lot of sense to scale down when you have the capacity, but I'm worried about the situation when the polling interval is shorter than the task duration.
For example, with avg 60 and SLO 300, we have the following situation. Assuming that some of the running tasks are very close to their end and get shut down, some are not, so we don't take any new tasks, and the polling happens very often.
poll 1: pending: 10, running: 20 -> suggest to scale down to 6
poll 2: pending: 10, running: 6 -> suggest to scale down to 4
poll 3: pending: 10, running: 4 -> suggest to scale down to 3
poll 4: pending: 10, running: 3 -> noop
Also, with this simulation it makes sense to go to 3 in the first run, hmmm. Maybe I should rethink the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the value I got for value I arrived at for hitting our SLO with the least slack in the case we are discussing was pending / (tasks_per_replica - 1)
which gives the value 3 you have. That said, I don't think scaling down slightly slower that is optimal is a problem, as we end up finishing work faster.
That last argument could also be used to argue for keeping the current clamping you have. If we did that, we should adjust the comment here to say that we don't scale down in order to rush to finish, rather than what is there currently.
However, in practice, given that tasks_per_replica
(at least in prod) is always 2, any of values will lead to finishing in about the same time (since needed_replicas >= pending
for any of the calculations). That suggest we should use the simplest logic, which I think is dropping any special handling of the the needed_replicas < running
(either clamping, or using an alternative formula).
relative.
get_new_worker_count
toget_target_replica_count
.max_replicas
andmin_replicas
a level higher in the config.The limits are generic enough to be handled by the main logic.
Fixes Noop API request should be skipped #38