Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add traffic_rule label to router metrics #280

Merged
merged 8 commits into from
Nov 8, 2022

Conversation

krithika369
Copy link
Collaborator

@krithika369 krithika369 commented Nov 3, 2022

This PR introduces a new label traffic_rule to the following Prometheus metrics produced by the Turing router:

  • RouteRequestDurationMs
  • TuringComponentRequestDurationMs, where the component is route

Eg:

mlp_turing_route_request_duration_ms_bucket{route="control",status="failure",traffic_rule="traffic-split-default",le="2"} 0
mlp_turing_route_request_duration_ms_count{route="control",status="success",traffic_rule="traffic-split-b"} 1

mlp_turing_turing_comp_request_duration_ms_bucket{component="route",status="success",traffic_rule="traffic-split-default",le="2"} 0
mlp_turing_route_request_duration_ms_count{route="route-b",status="success",traffic_rule="traffic-split-b"} 1

Changes

  • engines/router/missionctl/fiberapi/traffic_splitting_strategy.go - Set traffic-rule label in SelectRoute, to hold the matched traffic rule. Further, this function is refactored to only pick the first traffic rule matched as realistically, we may not have enough time to (sequentially) execute multiple matches. This also simplifies the matched rule metrics logging (otherwise, when there are multiple matched rules, we need more intricate wiring within Fiber to know which rule was ultimately executed).
  • engines/router/missionctl/fiberapi/interceptors.go - Associate traffic_rule label to the RouteRequestDurationMs metric
  • engines/router/missionctl/mission_control.go, engines/router/missionctl/mission_control_upi.go - Associate traffic_rule label to the TuringComponentRequestDurationMs metric for the route step.

Known issues

There is a limitation as of the current implementation in this PR (and the relevant Fiber PR): Issue #281

This will be addressed separately, with other improvements to timeout scenarios.

TODO

@krithika369 krithika369 marked this pull request as draft November 3, 2022 14:46
@krithika369 krithika369 self-assigned this Nov 4, 2022
@krithika369 krithika369 changed the title Add traffic-rule label to router metrics Add traffic_rule label to router metrics Nov 4, 2022
@krithika369 krithika369 marked this pull request as ready for review November 4, 2022 02:31
@krithika369 krithika369 requested a review from a team November 4, 2022 02:31

// select primary route and fallbacks, based on the results of testing
// given request against traffic-splitting rules
for i := 0; i < len(results); i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have some misconception here,

I would had thought because of the orthogonality checks on the rules, there should always be at most one rule that can be matched given a request, and that should be the reason of the removal (and probably previously), here instead of

we may not have enough time to (sequentially) execute multiple matches

didnt catch the timing part since to me there should only be 1 rule matched for any request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you're right about your assumption but it's missing one detail - the idea of overlapping rules at different priorities (kind of similar to XP experiments). This is the relevant validation.

So, if there were 2 rules:

  • {"a": [1]}
  • {"a": [1], "b": [2]}
    A request that matches # 2, will also match # 1. In this case, we will just go with the order in which the rules are defined, to pick the appropriate rule (per this example, # 2 will never be applied because it comes after the broader rule).

@leonlnj
Copy link
Contributor

leonlnj commented Nov 7, 2022

Left a comment, the rest LGTM!

@krithika369
Copy link
Collaborator Author

Thanks for the review, @leonlnj. @caraml-dev/turing-dev, if there are no other comments, could one of you hit the "approve" button to satisfy the CI ? :)

@deadlycoconuts
Copy link
Contributor

LGTM! Thanks for this! 🚀

Copy link
Collaborator

@terryyylim terryyylim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a clarifying question, the rest LGTM, thanks @krithika369 !

// Given request hasn't satisfied any of the rules configured on this routing strategy;
// check if default route exists.
if defaultRoute, exist := routes[s.DefaultRouteID]; exist {
return defaultRoute, []fiber.Component{}, labels.WithLabel(TrafficRuleLabel, defaultRoute.ID()), nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally when we return orderedRoutes[1:], that value was never used right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct.

@krithika369
Copy link
Collaborator Author

Thanks for the reviews, all! Merging.

@krithika369 krithika369 merged commit 5341a40 into caraml-dev:main Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants