Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: frequent 503 errors when connecting to a Service experiencing high Pod churn #4754

Merged
merged 5 commits into from
Nov 27, 2024

Conversation

zhaohuabing
Copy link
Member

@zhaohuabing zhaohuabing commented Nov 21, 2024

Reverts #4337
Fixes: #4685

Release Note: Yes

To avoid regression of the bug fixed by #4337, a temporary buffer has been added to store events received before the Updater is enabled.
These events will be sent to the update channel once the Updater is enabled.

Tested the fix in my local env:

➜  k -n envoy-gateway-system logs envoy-gateway-7f7d68bd-q74m6 |grep "status update"                        
2024-11-26T05:56:03.461Z        INFO    provider        kubernetes/status_updater.go:185        received a status update while disabled, storing for later      {"runner": "provider", "event": {
"name":"ha-gateway","namespace":"gateway-upgrade-infra"}}
2024-11-26T05:56:03.764Z        INFO    provider        kubernetes/status_updater.go:185        received a status update while disabled, storing for later      {"runner": "provider", "event": {
"name":"upgrade"}}                                                                              
2024-11-26T05:56:03.767Z        INFO    provider        kubernetes/status_updater.go:185        received a status update while disabled, storing for later      {"runner": "provider", "event": {
"name":"http-backend-eg-upgrade","namespace":"gateway-upgrade-infra"}}
2024-11-26T05:56:03.767Z        INFO    provider        kubernetes/status_updater.go:185        received a status update while disabled, storing for later      {"runner": "provider", "event": {
"name":"ha-gateway","namespace":"gateway-upgrade-infra"}}
2024-11-26T05:56:28.146Z        INFO    provider        kubernetes/status_updater.go:139        started status update handler   {"runner": "provider"}
2024-11-26T05:56:28.146Z        INFO    provider        kubernetes/status_updater.go:197        sending stored status update    {"runner": "provider", "event": {"name":"ha-gateway","namespace":
"gateway-upgrade-infra"}}       
2024-11-26T05:56:28.146Z        INFO    provider        kubernetes/status_updater.go:197        sending stored status update    {"runner": "provider", "event": {"name":"upgrade"}}
2024-11-26T05:56:28.146Z        INFO    provider        kubernetes/status_updater.go:197        sending stored status update    {"runner": "provider", "event": {"name":"http-backend-eg-upgrade"
,"namespace":"gateway-upgrade-infra"}}  
2024-11-26T05:56:28.146Z        INFO    provider        kubernetes/status_updater.go:197        sending stored status update    {"runner": "provider", "event": {"name":"ha-gateway","namespace":
"gateway-upgrade-infra"}}       

If anyone wants to verify this PR, you can do so by using this image zhaohuabing/gateway-dev:52e6d8577

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 60.60606% with 13 lines in your changes missing coverage. Please review.

Project coverage is 65.58%. Comparing base (48a0310) to head (3b9ba84).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/status_updater.go 60.60% 12 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4754      +/-   ##
==========================================
- Coverage   65.60%   65.58%   -0.02%     
==========================================
  Files         211      211              
  Lines       31961    31989      +28     
==========================================
+ Hits        20968    20981      +13     
- Misses       9753     9766      +13     
- Partials     1240     1242       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhaohuabing zhaohuabing force-pushed the revert-4337-fix-missing-status-update branch 3 times, most recently from a2d7838 to 52e6d85 Compare November 26, 2024 06:22
Signed-off-by: Huabing Zhao <[email protected]>
@zhaohuabing zhaohuabing force-pushed the revert-4337-fix-missing-status-update branch from 52e6d85 to 07c9337 Compare November 26, 2024 06:33
@arkodg arkodg requested review from a team November 27, 2024 01:25
@zirain
Copy link
Contributor

zirain commented Nov 27, 2024

should we change the PR title?

@arkodg
Copy link
Contributor

arkodg commented Nov 27, 2024

should we change the PR title?

yeah +1, because this PR does more than just revert

@zhaohuabing zhaohuabing changed the title Revert "fix: some status updates are discarded by the status updater" Fix: frequent 503 errors when connecting to a Service experiencing high Pod churn Nov 27, 2024
@zirain
Copy link
Contributor

zirain commented Nov 27, 2024

/retest

@zirain
Copy link
Contributor

zirain commented Nov 27, 2024

follow up: need a test to make sure refactor won't broken again.

@zhaohuabing zhaohuabing merged commit 8ec3095 into main Nov 27, 2024
24 checks passed
@zhaohuabing zhaohuabing deleted the revert-4337-fix-missing-status-update branch November 27, 2024 02:54
zhaohuabing added a commit to zhaohuabing/gateway that referenced this pull request Nov 27, 2024
…gh Pod churn (envoyproxy#4754)

* Revert "fix: some status updates are discarded by the status updater (envoyproxy#4337)"

This reverts commit 14830c7.

Signed-off-by: Huabing Zhao <[email protected]>

* store update events and process it later

Signed-off-by: Huabing Zhao <[email protected]>

* rename method

Signed-off-by: Huabing Zhao <[email protected]>

* add release note

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
zhaohuabing added a commit that referenced this pull request Nov 27, 2024
* fix: tcp listener is rejected when no route attached (#4681)

* fix: tcp listener is rejected when no route attached

Signed-off-by: Huabing Zhao <[email protected]>

* change cluter name

Signed-off-by: Huabing Zhao <[email protected]>

* fix listener connection limit test

Signed-off-by: Huabing Zhao <[email protected]>

* fix listener connetcp keepalive  test

Signed-off-by: Huabing Zhao <[email protected]>

* fix tcp endpoint stats test

Signed-off-by: Huabing Zhao <[email protected]>

* fix tcp-route-enable-req-resp-sizes-stats

Signed-off-by: Huabing Zhao <[email protected]>

* fix extensionpolicy-tcp-udp-http test

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
(cherry picked from commit f99c36c)
Signed-off-by: Huabing Zhao <[email protected]>

* fix: remove backendrefs validation (#4705)

* remove backendrefs validation

Signed-off-by: Huabing Zhao <[email protected]>

* add tests

Signed-off-by: Huabing Zhao <[email protected]>

* add tests

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
Co-authored-by: zirain <[email protected]>
(cherry picked from commit 5068698)
Signed-off-by: Huabing Zhao <[email protected]>

* fix: translator reports errors for existing clusters and secretes (#4707)

* fix: existing clusters and secretes

Signed-off-by: Huabing Zhao <[email protected]>

* fix cluster index for SP

Signed-off-by: Huabing Zhao <[email protected]>

* minor change

Signed-off-by: Huabing Zhao <[email protected]>

* minor change

Signed-off-by: Huabing Zhao <[email protected]>

* minor change

Signed-off-by: Huabing Zhao <[email protected]>

* minor change

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* add comment

Signed-off-by: Huabing Zhao <[email protected]>

* remove index

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>

* xds: always use `::` and `IPv4Compact` for dynamic listener (#4743)

* enable IPv4Compact

Signed-off-by: zirain <[email protected]>

* fix xds test

Signed-off-by: zirain <[email protected]>

* release-notes

Signed-off-by: zirain <[email protected]>

* nit

Signed-off-by: zirain <[email protected]>

* gen

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
(cherry picked from commit 78da42c)
Signed-off-by: Huabing Zhao <[email protected]>

* Fix: frequent 503 errors when connecting to a Service experiencing high Pod churn (#4754)

* Revert "fix: some status updates are discarded by the status updater (#4337)"

This reverts commit 14830c7.

Signed-off-by: Huabing Zhao <[email protected]>

* store update events and process it later

Signed-off-by: Huabing Zhao <[email protected]>

* rename method

Signed-off-by: Huabing Zhao <[email protected]>

* add release note

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>

* xds: use V4_PREFERRED dnsLookupFamily by default (#4745)

* use Cluster_V4_PREFERRED

Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: zirain <[email protected]>
Co-authored-by: zirain <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this pull request Nov 27, 2024
…gh Pod churn (envoyproxy#4754)

* Revert "fix: some status updates are discarded by the status updater (envoyproxy#4337)"

This reverts commit 14830c7.

Signed-off-by: Huabing Zhao <[email protected]>

* store update events and process it later

Signed-off-by: Huabing Zhao <[email protected]>

* rename method

Signed-off-by: Huabing Zhao <[email protected]>

* add release note

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
(cherry picked from commit 8ec3095)
Signed-off-by: Guy Daich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

503s and _No_route_to_host errors due to routing to non-existent Endpoints
3 participants