503s and `_No_route_to_host` errors due to routing to non-existent Endpoints #4685

coro · 2024-11-08T15:58:34Z

Description:
We have been seeing many 503 errors when connecting to a Service with a lot of Pod churn.

We also saw in the logs that in these cases, the upstream_host that Envoy was attempting to connect to was for Pods that no longer existed in the cluster. These Pods could have been terminated over 50 mins earlier.

Repro steps:
Simple setup of Gateway (AWS NLB) -> HTTPRoute -> Service pointing to a Deployment with a lot of Pod churn.

Environment:
Gateway: v1.2.1 (also seen on v1.1.3) (not seen on v1.1.1)
Envoy: v1.32.1 (also seen on v1.31.1) (not seen on v1.31.0)
EKS cluster v1.29

Logs:

{
  "start_time": "2024-11-08T13:54:35.382Z",
  "method": "POST",
  "x-envoy-origin-path": "/v1/models/custom-model:predict",
  "protocol": "HTTP/1.1",
  "response_code": "503",
  "response_flags": "UF",
  "response_code_details": "upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_No_route_to_host}",
  "connection_termination_details": "-",
  "upstream_transport_failure_reason": "delayed_connect_error:_No_route_to_host",
  "bytes_received": "494",
  "bytes_sent": "165",
  "duration": "3060",
  "x-envoy-upstream-service-time": "-",
  "x-forwarded-for": "10.0.99.64",
  "user-agent": "python-requests/2.32.3",
  "x-request-id": "3dbc6bcc-79b1-4ec1-855c-1569b69f5416",
  ":authority": "kserve-detection.prod.signal",
  "upstream_host": "10.0.101.243:8080",
  "upstream_cluster": "httproute/default/detection/rule/0",
  "upstream_local_address": "-",
  "downstream_local_address": "10.0.101.92:10080",
  "downstream_remote_address": "10.0.99.64:41562",
  "requested_server_name": "-",
  "route_name": "httproute/default/detection/rule/0/match/0/kserve-detection_prod_signal"
}

The generated cluster config for this route was:

   "dynamic_active_clusters": [
    {
     "version_info": "a998a3b0100c28a29ff3ee662cb01dc45463f0260c9a6a50107dcc12a45938fc",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "httproute/default/detection/rule/0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "httproute/default/detection/rule/0"
      },
      "connect_timeout": "10s",
      "per_connection_buffer_limit_bytes": 32768,
      "lb_policy": "LEAST_REQUEST",
      "circuit_breakers": {
       "thresholds": [
        {
         "max_retries": 1024
        }
       ]
      },
      "dns_lookup_family": "V4_ONLY",
      "outlier_detection": {},
      "common_lb_config": {
       "locality_weighted_lb_config": {}
      },
      "ignore_health_on_host_removal": true
     },
     "last_updated": "2024-11-08T14:29:21.557Z"
    },
...

The text was updated successfully, but these errors were encountered:

arkodg · 2024-11-08T19:42:02Z

hey @coro, we haven't seen such issues yet, can you also run kubectl get endpointslice -l <service label> -o yaml as well as egctl config envoy-proxy endpoint multiple times over a few seconds, so we can better understand the consistency issue

to avoid this, you can set up retries and passive health checks, here's an example

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: retry-for-route
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: backend
  retry:
    numRetries: 5
    perRetry:
      backOff:
        baseInterval: 100ms
        maxInterval: 10s
      timeout: 250ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes
  healthCheck:
    passive:
      baseEjectionTime: 10s
      interval: 2s
      maxEjectionPercent: 100
      consecutive5XxErrors: 1 
      consecutiveGatewayErrors: 0
      consecutiveLocalOriginFailures: 1
      splitExternalLocalOriginErrors: false

coro · 2024-11-11T09:51:24Z

Thanks for the suggestion @arkodg! On the endpointslice front, anecdotally I can tell you that we did not see the IPs of the non-existent Pods in the EndpointSlices, but we will check again with that and the egctl output and get back to you.

It's worth mentioning that this was on a KServe InferenceService being scaled by a KEDA autoscaler, so perhaps there is some kind of race condition between KEDA's control of the EndpointSlice and EG's reading of it. Again, we will take a look and let you know.

arkodg · 2024-11-11T16:53:56Z

cc @dprotaso

coro · 2024-11-12T13:32:06Z

@arkodg Just a quick update, we can confirm that this does not happen with EG v1.1.1 / Envoy v1.31.0. Rolling back seems to have resolved this issue. We haven't had a moment to safely reproduce this and get your additional debug info just yet, so we will let you know again when we have that.

coro · 2024-11-12T13:38:40Z

cc @evilr00t @sam-burrell

sam-burrell · 2024-11-13T13:50:07Z

We can confirm that pods can be terminated and egctl config envoy-proxy endpoint for the route confirms the pod is not in the list of endpoints lbEndpoints, there is no mention of the ip of the pod (healthy or otherwise)

Yet we still get logs after that point in time from envoy saying upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_Connection_refused} for the upstream ip of a pod not in the list of lbEndpoints from envoy.

There seems to be some routing still within envoy that is trying to route to a dead/non existant pod. Perhaps this is cached or perhaps there is a long lived connection still open or trying to be used?

This is continuing from the same environment as @coro (We work on the same team)

evilr00t · 2024-11-13T16:22:00Z

Hi!

Looks like we were able to identify the issue.

We've started comparing proxy endpoints egctl config envoy-proxy endpoint -n kube-system envoy_pod
and since v1.1.3 (we've checked 1.1.1 and 1.1.2) there seems to be a regression.

After upgrade controller does not update it's proxy endpoints anymore - only during the startup which as you can imagine in very dynamic k8s world causes a lot of issues.

Kubernetes endpointslice is being updated but the endpoints on the proxy are static since the startup of gateway - rollout restart updates endpoints but this is not what we want.

We were comparing different tags and I think this is what caused the regression altho I'm not 100% sure - just giving you a hint. (As those changes were cherry-picked from 1.2.0 into 1.1.3 where we also noticed same issue)

#4336
https://github.com/envoyproxy/gateway/pull/4337/files

arkodg · 2024-11-14T05:57:17Z

unsure how status updater can affect endpointslice reconciliation
cc @zhaohuabing

ovaldi · 2024-11-14T08:36:32Z

We’re also facing the same issue.

And downgrade EG from v1.2.1 to v1.1.2 resolved this issue.

ligol · 2024-11-14T15:19:10Z

We're getting the same error on our side too, rollbacking to v1.1.2 seemed fixing the issue on our side too

zhaohuabing · 2024-11-14T18:08:56Z

@ligol @ovaldi @evilr00t @sam-burrell @coro

Are there any easy way to reproduce this? I tried to modify the replicas of the deploy to increase/descrease pods, but can't reproduce this issue.

Also are there any errors/warnings in the EG/Envoy logs while this happened?

Can you also try
v1.1.3+ envoyproxy/envoy:distroless-v1.31.2
or
v1.1.2 envoyproxy/envoy:distroless-v1.31.3?

Looks like the only significant change within v1.1.3 is the upgrade of envoy to v1.31.3.

ligol · 2024-11-14T18:33:29Z

I will try to test the different combinaison tomorrow, when I'll have some traffic on the preprod environement, otherwise I don't really have any other way to reproduce

sam-burrell · 2024-11-15T10:32:13Z

These are all the combinations we have tested

✔️ Working
gateway 1.1.1
Envoy image envoyproxy/envoy:distroless-v1.31.1

✔️ Working
gateway 1.1.2
Envoy image envoyproxy/envoy:distroless-v1.31.2

✔️ Working
gateway 1.1.2
Envoy image envoyproxy/envoy:distroless-v1.31.3

🛑 Not Working
gateway 1.1.3
Envoy image envoyproxy/envoy:distroless-v1.31.1

🛑 Not Working
gateway 1.1.3
Envoy image envoyproxy/envoy:distroless-v1.31.2

🛑 Not Working
gateway 1.2.1
Envoy image envoyproxy/envoy:distroless-v1.32.1

All on EKS cluster v1.29.8

sam-burrell · 2024-11-15T10:47:08Z

The following shell output is how we are defining Not Working
We get results like the following after randomly killing a few pods

EndpointSlice FOOBAR_ENDPOINT endpoints:                       ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131']

Some gateway pods have IPs that don't exist in the endpoint slice
IP 10.0.106.41 is not in the endpoint slice and is returning 503 when hit by envoy

The following is the script we are using to test this:

import subprocess
import json
from datetime import datetime
from pprint import pprint

HTTP_ROUTE = "HTTP_ROUTE"
ENDPOINT_SLICE = "ENDPOINT_SLICE"
GATEWAY_POD_PREFIX = "GATEWAY_POD_PREFIX"

# Define the command to get the EndpointSlice
command = ["kubectl", "get", "endpointslice", "-o", "json", ENDPOINT_SLICE]
result = subprocess.run(command, capture_output=True, text=True)
endpointslice_output = result.stdout
endpointslice_data = json.loads(endpointslice_output)

# Print the EndpointSlice information
ENDPOINT_SLICE_endpoints = [
    endpoint["addresses"][0]
    for endpoint in endpointslice_data["endpoints"]
    if endpoint["conditions"]["ready"] == True and endpoint["conditions"]["serving"] == True
]

ENDPOINT_SLICE_endpoints = sorted(ENDPOINT_SLICE_endpoints)
print(f"EndpointSlice {ENDPOINT_SLICE} endpoints: \t\t\t\t\t\t\t\t\t\t{ENDPOINT_SLICE_endpoints}")


command = ["kubectl", "get", "pods", "-n", "kube-system", "-o", "json"]
result = subprocess.run(command, capture_output=True, text=True)
pods_output = result.stdout

# Parse the JSON output to get the pod names
pods_data = json.loads(pods_output)
envoy_pods = [
    pod["metadata"]["name"]
    for pod in pods_data["items"]
    if pod["metadata"]["name"].startswith(GATEWAY_POD_PREFIX)
]

all_selected_endpoints = []

for envoy_pod in envoy_pods:
    # Define the command to run
    command = [
        "egctl", "config", "envoy-proxy", "endpoint", "-n", "kube-system",
        envoy_pod
    ]

    # Run the command and capture the output
    result = subprocess.run(command, capture_output=True, text=True)
    output = result.stdout

    # Parse the JSON output
    data = json.loads(output)

    # Extract the required information
    endpoints = data["kube-system"][envoy_pod]["dynamicEndpointConfigs"]
    selected_endpoints = [
        lb_endpoint["endpoint"]["address"]["socketAddress"]["address"]
        for ep in endpoints
        if ep["endpointConfig"]["clusterName"] == HTTP_ROUTE
        for lb_endpoint in ep["endpointConfig"]["endpoints"][0]["lbEndpoints"]
    ]
    selected_endpoints = sorted(selected_endpoints)
    print(f"Selected endpoints for {HTTP_ROUTE} on {envoy_pod}: \t{selected_endpoints}")

sam-burrell · 2024-11-15T10:59:13Z

We have turned debug logs on for envoy-gateway-controller and see nothing obvious.
We do not see any logs in envoy-gateway-controller matching "error|warning|fatal|Error"
We do not see any logs in envoy-gateway-pods other than the 503s

sam-burrell · 2024-11-19T09:42:26Z

Is there anything we can do to assist with this?

zhaohuabing · 2024-11-20T04:22:17Z

I tried the below setup with both 1.2.1 and 1.1.2

Constantly create and delete pods with:

while true
do
kubectl patch deployment backend -p '{"spec":{"replicas":100}}'
sleep 10
kubectl patch deployment backend -p '{"spec":{"replicas":1}}'
sleep 10
done

And access the Gateway with

while true
do
status_code=`curl -o /dev/null -s -w "%{http_code}\n"  -HHost:www.example.com http://172.18.0.200`
if [ "$status_code" -eq 503 ]; then
  echo "503"
fi
done

And I got some 503 erros with both the 1.2.1 and 1.1.2. After stopping creating and deleting of pods, the 503 errors ceased for both versions. There doesn't seem to be any noticeable difference in behavior between versions 1.2.1 and 1.1.2.

Probably there's something missing in my configuration, so I can't reproduce this? @sam-burrell @ligol can you help me to reproduce this?

zhaohuabing · 2024-11-20T05:12:34Z

The following shell output is how we are defining Not Working
We get results like the following after randomly killing a few pods

I think it's acceptable for the endpoints in envoy xDS temporarily differ from the k8s endpointslice. This delay occurs because Envoy Gateway (EG) needs time to propagate the changes to Envoy. However, if an endpoint still exists in Envoy for a while after being deleted from the EndpointSlice, it indicates a problem.

zhaohuabing · 2024-11-21T13:52:30Z

@ligol @ovaldi @evilr00t @sam-burrell @coro Could you please try if #4754 can fix this?

dromadaire54 · 2024-11-21T15:17:11Z

@zhaohuabing we are going to try if the revert fix the issue.

zhaohuabing · 2024-11-22T08:19:25Z

@zhaohuabing we are going to try if the revert fix the issue.

@dromadaire54 Please also try #4767 if you have a stable reproducable env. Thanks!

evilr00t · 2024-11-22T10:39:35Z

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.

Screenshot included:

zirain · 2024-11-22T10:53:30Z

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.

Screenshot included:

it's not merged, so you need built it yourself.

arkodg · 2024-11-22T19:13:00Z

@zhaohuabing can you provide a custom image from your docker hub repo (with the commit reverted) that others can try ?

zhaohuabing · 2024-11-23T01:45:08Z

zhaohuabing/gateway-dev:e406088d7 fix: decoup gateway status updates from the reconciler #4767
zhaohuabing/gateway-dev:d3e3c6bf7 Revert "fix: some status updates are discarded by the status updater" #4754

I prefer #4767 because reverting in 4755 will cause regression of another issue.

zhaohuabing · 2024-11-23T12:23:55Z

Update: please use zhaohuabing/gateway-dev:e406088d7 for #4767. The new image contains the fix for e2e tests.

coro added the triage label Nov 8, 2024

arkodg added kind/bug Something isn't working cherrypick/release-v1.2.2 labels Nov 19, 2024

zhaohuabing mentioned this issue Nov 21, 2024

fix 503 error caused by outdated endpooints #4753

Closed

This was referenced Nov 22, 2024

Revert "fix: some status updates are discarded by the status updater" #4754

Open

fix: decoup gateway status updates from the reconciler #4767

Open

zhaohuabing removed the triage label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

503s and `_No_route_to_host` errors due to routing to non-existent Endpoints #4685

503s and `_No_route_to_host` errors due to routing to non-existent Endpoints #4685

coro commented Nov 8, 2024 •

edited

Loading

arkodg commented Nov 8, 2024

coro commented Nov 11, 2024

arkodg commented Nov 11, 2024

coro commented Nov 12, 2024

coro commented Nov 12, 2024

sam-burrell commented Nov 13, 2024

evilr00t commented Nov 13, 2024 •

edited

Loading

arkodg commented Nov 14, 2024

ovaldi commented Nov 14, 2024 •

edited

Loading

ligol commented Nov 14, 2024

zhaohuabing commented Nov 14, 2024 •

edited

Loading

ligol commented Nov 14, 2024

sam-burrell commented Nov 15, 2024 •

edited

Loading

sam-burrell commented Nov 15, 2024 •

edited

Loading

sam-burrell commented Nov 15, 2024

sam-burrell commented Nov 19, 2024

zhaohuabing commented Nov 20, 2024 •

edited

Loading

zhaohuabing commented Nov 20, 2024

zhaohuabing commented Nov 21, 2024

dromadaire54 commented Nov 21, 2024 •

edited

Loading

zhaohuabing commented Nov 22, 2024 •

edited

Loading

evilr00t commented Nov 22, 2024

zirain commented Nov 22, 2024

arkodg commented Nov 22, 2024

zhaohuabing commented Nov 23, 2024 •

edited

Loading

zhaohuabing commented Nov 23, 2024 •

edited

Loading

503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685

503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685

Comments

coro commented Nov 8, 2024 • edited Loading

arkodg commented Nov 8, 2024

coro commented Nov 11, 2024

arkodg commented Nov 11, 2024

coro commented Nov 12, 2024

coro commented Nov 12, 2024

sam-burrell commented Nov 13, 2024

evilr00t commented Nov 13, 2024 • edited Loading

arkodg commented Nov 14, 2024

ovaldi commented Nov 14, 2024 • edited Loading

ligol commented Nov 14, 2024

zhaohuabing commented Nov 14, 2024 • edited Loading

ligol commented Nov 14, 2024

sam-burrell commented Nov 15, 2024 • edited Loading

sam-burrell commented Nov 15, 2024 • edited Loading

sam-burrell commented Nov 15, 2024

sam-burrell commented Nov 19, 2024

zhaohuabing commented Nov 20, 2024 • edited Loading

zhaohuabing commented Nov 20, 2024

zhaohuabing commented Nov 21, 2024

dromadaire54 commented Nov 21, 2024 • edited Loading

zhaohuabing commented Nov 22, 2024 • edited Loading

evilr00t commented Nov 22, 2024

zirain commented Nov 22, 2024

arkodg commented Nov 22, 2024

zhaohuabing commented Nov 23, 2024 • edited Loading

zhaohuabing commented Nov 23, 2024 • edited Loading

503s and `_No_route_to_host` errors due to routing to non-existent Endpoints #4685

503s and `_No_route_to_host` errors due to routing to non-existent Endpoints #4685

coro commented Nov 8, 2024 •

edited

Loading

evilr00t commented Nov 13, 2024 •

edited

Loading

ovaldi commented Nov 14, 2024 •

edited

Loading

zhaohuabing commented Nov 14, 2024 •

edited

Loading

sam-burrell commented Nov 15, 2024 •

edited

Loading

sam-burrell commented Nov 15, 2024 •

edited

Loading

zhaohuabing commented Nov 20, 2024 •

edited

Loading

dromadaire54 commented Nov 21, 2024 •

edited

Loading

zhaohuabing commented Nov 22, 2024 •

edited

Loading

zhaohuabing commented Nov 23, 2024 •

edited

Loading

zhaohuabing commented Nov 23, 2024 •

edited

Loading