Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SLO calculations #91

Closed
wants to merge 1 commit into from
Closed

Refactor SLO calculations #91

wants to merge 1 commit into from

Conversation

rail
Copy link
Contributor

@rail rail commented Apr 22, 2020

  • Simplify the main logic by using absolute replica counts instead of
    relative.
  • Rename get_new_worker_count to get_target_replica_count.
  • Move max_replicas and min_replicas a level higher in the config.
    The limits are generic enough to be handled by the main logic.
  • Do not call the Kubernetes autoscaling API if no change is needed.
    Fixes Noop API request should be skipped #38

* Simplify the main logic by using absolute replica counts instead of
  relative.
* Rename `get_new_worker_count` to `get_target_replica_count`.
* Move `max_replicas` and `min_replicas` a level higher in the config.
  The limits are generic enough to be handled by the main logic.
* Do not call the Kubernetes autoscaling API if no change is needed.
  Fixes #38
@rail rail self-assigned this Apr 22, 2020
@rail rail requested a review from tomprince April 22, 2020 04:16
Copy link
Contributor

@tomprince tomprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but should probably have a pair of eyes on it that didn't have a hand in writing it.


While trying to work through the logic here, I wondered a little bit of there was existing literature on this that we could crib from, rather than deriving things from first principles (which is what I've been doing).

capacity = cfg["autoscale"]["args"]["max_replicas"] - running
log_env["capacity"] = capacity
max_replicas = cfg["autoscale"]["max_replicas"]
min_replicas = cfg["autoscale"]["min_replicas"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to duplicate the value above (both as a local and in the log).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, good catch. Will fix it.

adjust_scale(api, target_replicas, cfg["deployment_namespace"], cfg["deployment_name"])
logger.info("Calculating target replica count", extra=log_env)
target_replicas = get_target_replica_count(pending, running, cfg["autoscale"]["args"])
target_replicas = max(min(target_replicas, max_replicas), min_replicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this calculation deserves a comment on a function.

# = running + ((pending - (running*(tasks_per_replica-1))) / tasks_per_replica)
# = running + ((pending + running - running*tasks_per_replica) / tasks_per_replica)
# = running + ((pending + running) / tasks_per_replica - running)
# = (pending + running) / tasks_per_replica
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this algebra is useful for showing that the new and old calculations are equivalent; it is not clear to me how useful it is on its own (particularly given that it refers to variable names that only exist in the old logic.

# How many tasks a replica can process within our tolerance period
new_tasks_per_replica = math.floor(args["slo_seconds"] / args["avg_task_duration"])
tasks_per_replica = math.floor(args["slo_seconds"] / args["avg_task_duration"])
# how many tasks can be covered by the running replicas, assuming they are
# busy and can only take new tasks after they are done with the current one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs updating.

needed_replicas = math.ceil(outstanding / tasks_per_replica)
# Do not scale down in case we have pending, because some workers will
# receive SIGUSR1 and won't take any tasks after that.
needed_replicas = max(needed_replicas, running)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah. I lost a long comment I had on this line, suggesting we get rid of clamping here, as I needed_replicas is already an overestimate when needed_replicas < running (assuming the workers complete the tasks that they are running before shutting down), so clamping it makes our estimate even worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes a lot of sense to scale down when you have the capacity, but I'm worried about the situation when the polling interval is shorter than the task duration.

For example, with avg 60 and SLO 300, we have the following situation. Assuming that some of the running tasks are very close to their end and get shut down, some are not, so we don't take any new tasks, and the polling happens very often.

poll 1: pending: 10, running: 20 -> suggest to scale down to 6
poll 2: pending: 10, running: 6 -> suggest to scale down to 4
poll 3: pending: 10, running: 4 -> suggest to scale down to 3
poll 4: pending: 10, running: 3 -> noop

Also, with this simulation it makes sense to go to 3 in the first run, hmmm. Maybe I should rethink the approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the value I got for value I arrived at for hitting our SLO with the least slack in the case we are discussing was pending / (tasks_per_replica - 1) which gives the value 3 you have. That said, I don't think scaling down slightly slower that is optimal is a problem, as we end up finishing work faster.

That last argument could also be used to argue for keeping the current clamping you have. If we did that, we should adjust the comment here to say that we don't scale down in order to rush to finish, rather than what is there currently.

However, in practice, given that tasks_per_replica (at least in prod) is always 2, any of values will lead to finishing in about the same time (since needed_replicas >= pending for any of the calculations). That suggest we should use the simplest logic, which I think is dropping any special handling of the the needed_replicas < running (either clamping, or using an alternative formula).

@rail rail closed this Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Noop API request should be skipped
2 participants