layout | title |
---|---|
post |
Optimizing Tail Latency in a Heterogeneous Environment with Istio, Envoy, and a Custom Kubernetes Operator |
- Abstract
- Introduction
- Identifying the Challenge
- Understanding the Problem
- Developing the Solution
- Results and Impact
- Conclusion
- Research and Community Engagement
- Future Work
- Acknowledgments
- Appendix: Implementation Details
This article details our approach to optimizing tail latency in Kubernetes using Istio, Envoy and a custom Kubernetes operator. We identified performance disparities caused by hardware variations which drove us to develop a solution that dynamically adjusts load balancing weights based on real-time CPU metrics. We managed to achieve significant reductions in tail latency, our findings demonstrate the effectiveness of adaptive load balancing strategies in improving microservices performance and reliability.
Running microservices in Kubernetes often involves dealing with various hardware generations and CPU architectures. In our infrastructure, we observed high tail latency in some of our services despite using Istio and Envoy as our service mesh. This article details our journey in identifying the root cause of this issue and implementing a custom solution using a Kubernetes operator to optimize tail latency. Tail latency optimization is crucial as it directly impacts user experience and system reliability.
We run multiple hardware generations and different CPU architectures within our Kubernetes clusters. Our service mesh, composed of Istio for control and Envoy for the data plane, uses the LEAST_REQUEST
load-balancing algorithm to distribute traffic between services. However, we noticed that certain services experienced significantly high tail latency. Upon investigation, we discovered that the disparities in hardware capabilities were the main cause of this issue.
Tail latency refers to the latency experienced by the slowest requests, typically measured at the 95th, 99th, or 99.9th percentile. High tail latency can negatively impact user experience and indicate underlying performance bottlenecks. In our case, tail latency matters because it represents the worst-case scenario for our service response times.
The default load balancing strategy in Envoy works well in homogeneous environments but struggles when hardware performance is uneven, leading to inefficient request distribution and high tail latency.
To address this problem, we developed a custom Kubernetes operator. This operator dynamically adjusts the load balancing weights of Envoy proxies using Istio's CRD called ServiceEntry. Here's how we implemented our solution:
We deployed a dedicated VictoriaMetrics TSDB to collect real-time CPU usage statistics for each pod. Our operator interfaces with VictoriaMetrics API to gather this data, calculating the average CPU usage for each service by aggregating individual pod metrics.
Based on the average CPU usage, the operator determines the "distance" of each pod's CPU usage from the average. Pods with CPU usage below the average are assigned higher weights, indicating they can handle more requests. Conversely, pods with higher-than-average CPU usage receive lower weights to prevent them from getting more requests and becoming bottlenecks.
The calculated weights are applied to the Envoy proxies via Istio's ServiceEntry resources. This dynamic adjustment ensures that request distribution considers each pod's real-time performance therefore optimizing load balancing to reduce tail latency.
Fig 1: High Level DesignBefore diving into the detailed metrics, here's a summary of the key improvements achieved through our optimization:
- Total CPU usage reduced by 20%
- P99 latency decreased by nearly 50%
- More balanced request distribution across pods
These improvements demonstrate significant enhancements in resource utilization, response times, and load distribution.
To evaluate the impact of our optimization strategy, we conducted extensive testing using a set of 15 Nginx pods, each executing a Lua script to calculate different Fibonacci numbers. This setup introduced variability in compute load, reflecting our heterogeneous environment.
We used Fortio to generate load at a rate of 1,500 requests per second (rps). The Nginx pods were configured to calculate Fibonacci numbers ranging from 25 to 29, creating varying levels of CPU usage. Here's the breakdown of our pod setup:
-
Pods calculating Fibonacci number for 25:
- sleep-lior-2-6794d4cfdc-2gs9b
- sleep-lior-2-6794d4cfdc-6r6lg
- sleep-lior-2-6794d4cfdc-rvmd2
-
Pods calculating Fibonacci number for 26:
- sleep-lior-2-6794d4cfdc-jgxqg
- sleep-lior-2-6794d4cfdc-stjzd
-
Pods calculating Fibonacci number for 27:
- sleep-lior-2-6794d4cfdc-7rrwr
- sleep-lior-2-6794d4cfdc-gv856
- sleep-lior-2-6794d4cfdc-jz462
- sleep-lior-2-6794d4cfdc-kr64w
- sleep-lior-2-6794d4cfdc-kxhwx
- sleep-lior-2-6794d4cfdc-m2xcx
- sleep-lior-2-6794d4cfdc-p594m
- sleep-lior-2-6794d4cfdc-qnlnl
- sleep-lior-2-6794d4cfdc-tffd9
-
Pod calculating Fibonacci number for 29:
- sleep-lior-2-6794d4cfdc-mp8sn
This distribution of Fibonacci calculations across pods simulates a heterogeneous environment where different nodes have varying computational capabilities. The pods calculating lower Fibonacci numbers (25 and 26) represent faster or less loaded nodes, while those calculating higher numbers (27 and especially 29) represent slower or more heavily loaded nodes.
- Total CPU Usage: ~10 CPUs for all pods
Fig 2: Total CPU usage before optimization:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{container!="POD", container=~"sleep-lior-2"})
- CPU Usage Range: 2.2 (highest pod) to 0.2 (lowest pod)
Fig 3: Per-pod CPU usage before optimization:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{container!="POD", container="sleep-lior-2"}) by (pod)
- Service Response Time:
Fig 4: Service latencies before optimization:
- histogram_quantile(0.50, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.90, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
histogram_quantile(0.5, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod p90 Latency: 38ms (ranging from 100ms to 10ms) ![alt text](images/per-pod-p90-before.png) Fig 6: Per-pod p90 latency before optimization:
histogram_quantile(0.9, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod p95 Latency: 47ms (ranging from 170ms to 17ms) ![alt text](images/per-pod-p95-before.png) Fig 7: Per-pod p95 latency before optimization:
histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod p99 Latency: 93ms (ranging from 234ms to 23ms) ![alt text](images/per-pod-p99-before.png) Fig 8: Per-pod p99 latency before optimization:
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod Request Rate: 100 requests per second (uniform)
Fig 9: Per-pod request rate before optimization:
sum(rate(istio_requests_total{container!="POD",destination_canonical_service=~"sleep-lior-2",pod=~"sleep-lior-2.*"})) by (pod)
- Total CPU Usage: Decreased to 8 CPUs
Fig 10: Total CPU usage after optimization:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{container!="POD", container=~"sleep-lior-2"})
- CPU Usage Range: 0.6 (highest pod) to 0.45 (lowest pod)
Fig 11: Per-pod CPU usage after optimization:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{container!="POD", container="sleep-lior-2"}) by (pod)
- Service Response Time:
Fig 12: Service latencies after optimization:
- histogram_quantile(0.50, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.90, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2"}[2m])) by (le,destination_canonical_service))
- Per pod p90 Latency: 24ms (ranging from 46ms to 21ms) ![alt text](images/per-pod-p90.png) Fig 14: Per-pod p90 latency after optimization: histogram_quantile(0.9, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod p95 Latency: 33ms (ranging from 50ms to 23ms) ![alt text](images/per-pod-p95.png) Fig 15: Per-pod p95 latency after optimization: histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod p99 Latency: 47ms (ranging from 92ms to 24ms) ![alt text](images/per-pod-p99.png) Fig 16: Per-pod p99 latency after optimization: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="sleep-lior-2",request_protocol="http",response_code=~"2.*",pod=~"sleep-lior-2.*"}[2m])) by (le,pod))
- Per pod Request Rate: Adjusted, ranging from 25 rp/s to 224 rp/s
Fig 17: Per-pod request rate after optimization:
sum(rate(istio_requests_total{container!="POD",destination_canonical_service=~"sleep-lior-2",pod=~"sleep-lior-2.*"})) by (pod)
The optimization demonstrated significant performance improvements:
- CPU Usage Reduction: Total usage decreased from 10 CPUs to 8 CPUs, indicating more efficient resource utilization.
- Latency Reductions: Significant improvements across all percentiles, with p99 latency nearly halved.
- Balanced Load Distribution: Request rates adjusted dynamically, ensuring faster pods handle more requests and slower pods handle fewer, contributing to lower latencies and balanced resource usage.
These improvements have real-world implications for user experience and system efficiency. The reduction in tail latency means that even the slowest 1% of requests are now processed twice as fast, leading to a more consistent and responsive user experience. The more efficient CPU utilization allows for better resource allocation, potentially reducing infrastructure costs or allowing for higher overall throughput with the same resources.
By focusing on CPU metrics and dynamically adjusting load balancing weights, we optimized the performance of our microservices running in a heterogeneous hardware environment. This approach, facilitated by a custom Kubernetes operator and leveraging Istio and Envoy, enabled us to reduce tail latency and improve overall system reliability significantly.
Our experience demonstrates that adapting load-balancing strategies to account for hardware variability can overcome performance disparities and create a more responsive and robust microservices architecture. This approach has broader implications for the industry, particularly for organizations managing diverse infrastructure or transitioning between hardware generations.
Our journey began with extensive research which led us to an article detailing Google's innovative methods for similar issues. This discovery was transformative, affirming that load balancing of least connections is a common challenge. Google developed an internal mechanism called Prequal, which optimizes load balancing by minimizing real-time latency and requests-in-flight (RIF), a concept not found in Envoy's load balancing.
Before developing our Kubernetes operator, we engaged with the community to explore existing solutions. This approach provided valuable insights and saved time. For example, during our tests, we encountered a bug that the community resolved in less than 24 hours, demonstrating the power of collaborative problem-solving.
We raised an issue on Istio's GitHub repository (istio/istio#50968) and witnessed a swift response from the community, highlighting the importance of collaboration in open-source projects.
Our Kubernetes operator is running in production and performing well. We've successfully implemented the first step of balancing CPU resources, and it's effective so far. Moving forward, our plans include:
- Monitoring and Iteration: Continuously monitoring performance and making necessary adjustments.
- Exploring Additional Metrics: Considering other metrics such as memory usage or network latency for finer load balancing.
- Community Collaboration: Working with Istio and Envoy communities to contribute our findings and improvements back to the open-source projects.
We believe our approach can serve as a blueprint for others facing similar challenges in heterogeneous Kubernetes environments, and we look forward to further optimizations and community contributions.
We would like to express our gratitude to the Istio and Envoy communities for their invaluable support and quick response to our queries. Special thanks to the VictoriaMetrics team for their high-performance monitoring solution that made our real-time metrics collection possible. We also appreciate the contributions of all team members involved in this project, whose dedication and expertise were crucial to its success.
For detailed implementation of our Fibonacci calculator used in the Nginx pods, please refer to our Lua script.
For complete codebase and additional implementation details, please visit our GitHub repository: Istio-adaptive-least-request