gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

josh-berry · 2023-10-20T21:14:21Z

For full context, see the discussion on this PR: temporalio/cli#368 (comment)

Expected Behavior

If GetSystemInfo returns successfully, the gRPC health check should also pass.

Actual Behavior

For a period of up to about 1 second after GetSystemInfo succeeds, the gRPC health check may fail (returning NOT_SERVING), falsely indicating that gRPC is down when it's not.

This was causing frequent intermittent failures (such as this one) in the CLI CI/CD pipeline until we worked around it in temporalio/cli#368 .

Steps to Reproduce the Problem

Launch the server
Immediately try to connect to it using the Go SDK. (The Go SDK will wait for a successful GetSystemInfo response before returning a client object to the caller.)
Once the Go SDK returns a client object, immediately use the client object to perform a health check of the server.
Intermittently, the health check will fail.

Specifications

Version: 1.22.0
Platform: Seen on all platforms

dnr · 2023-10-23T21:02:28Z

It looks like the cli isn't using the service-specific health check (it's passing an empty request, so the empty string for service name). I think it should check for health of WorkflowService specifically.
I actually disagree this is incorrect behavior. Just because one rpc returns successfully doesn't mean that the service is healthy. The health indication might be looking at other things not relevant to that particular rpc method. The implication should go the other way: if the health check returns success, then all rpcs should return successfully (assuming valid requests, etc.). I.e. trust the health check over GetSystemInfo.

josh-berry · 2023-10-25T00:43:17Z

@dnr Thanks, makes sense. @cretz FYI in case we need to change what the SDKs are doing.

cretz · 2023-10-25T12:12:48Z

I actually disagree this is incorrect behavior. Just because one rpc returns successfully doesn't mean that the service is healthy.

It was decided when GetSystemInfo was created that it would replace health checks (temporalio/sdk-go#706) but maybe we weren't clear enough when describing the purpose when we created it. We don't want to force every gRPC client to make two RPC calls on every client connection just for info + health.

Can we choose not to start the gRPC server until it will be healthy? Same for our HTTP API and any other user-facing thing, if we can not serve until it can properly serve, that would be ideal. If we are going to serve even though we're not ready to serve, can we fail the GetSystemInfo check specifically so every client doesn't have to make two calls because we're responding to calls on a not-yet-healthy server?

dnr · 2023-10-25T21:15:21Z

We're talking from different perspectives: from the pov of an sdk talking to a properly configured temporal cluster, GetSystemInfo should be enough. But from the pov of a single frontend process answering rpcs that come from a load balancer or some kind of ingress controller, it wouldn't even expect to get external requests until it returns success to a health check.

The problem is, what if you have an external client talking directly to a frontend without a load balancer. Then there's these mismatched expectations. I.e. I agree external clients should not need to do health checks, but some layer does.

One option is to say the server is fine, and if you (i.e. a test environment) are starting a "bare" server, not in some kind of managed environment with controlled ingress, then you need to act like a load balancer and do a health check. But it's for the test environment to do, not the sdk.

Alternatively, if we do decide we want to change the server: I don't think we want to move starting the grpc server until after healthy (healthy == joins membership successfully) for the reasons explained in #4510. (Also note that membership has no way to say "join, but not ready for requests yet".) It's true that that's mainly talking about history, and frontend has different constraints since its requests are mostly (but not entirely) external. So maybe we could swap the order for frontend. But that feels inconsistent. It's probably better to do the other thing, block GetSystemInfo on membership. That's slightly worrying, since in theory it could be useful to call that on a frontend that hasn't joined membership yet. But probably fine in practice.

cretz · 2023-10-25T21:35:04Z

What can we do to make a single call on client connection to check health and get system info? We're only concerned with server startup health in this case.

It's probably better to do the other thing, block GetSystemInfo on membership.

I'd accept just failing it (or any call) when server not yet healthy on startup. Not talking about server health changing after startup, we just need to not be able to make this call successfully to an unhealthy endpoint on startup.

dnr · 2023-10-25T22:09:19Z

What is "we"? My point is that there are multiple layers with different responsibilities. The thing that starts the server should do a health check before it tells anything to connect to the server, and the SDK should do GetSystemInfo.

cretz · 2023-10-26T12:55:11Z

"We" as in Temporal company. Is there anything Temporal as a company can do to make sure a single call from an SDK on client connection both ensures server is ready to accept requests and returns system info?

The thing that starts the server should do a health check before it tells anything to connect to the server, and the SDK should do GetSystemInfo.

Latter is already done, but I fear the former is asking too much of users to alter their SDK code to do additional health checks upon client connection if, say, they are starting their server somewhere else. Don't even need to make this for every call, can we alter GetSystemInfo to do that health check at the top and error if it doesn't pass? I can submit the PR here if needed.

dnr · 2023-10-27T19:56:50Z

How many bits of code are there out there that start a server for testing? There are a few libraries that everyone calls, right?

cretz · 2023-10-27T22:20:31Z

We cannot know how many users start a server and then a client. It is very common in CI for example to just start a server container and do work. The goal is to not allow an SDK client connection (e.g. client.Dial) to return successfully if the server is not ready. How that server starts is unrelated.

dnr · 2023-10-27T23:02:11Z

Ok, we discussed among the team and decided to make all frontend methods return an error if not healthy.

cretz · 2023-10-30T12:23:43Z

Thanks! Just GetSystemInfo works for us, but this approach is even better assuming y'all don't mind the extra checks/interceptor. Now load balancers and retrying clients and such can eagerly know their call will not succeed. I do not believe there is any rush here.

dnr · 2023-11-03T00:14:40Z

I tried this in #5069. One potential issue I noticed is that Dial will return an error immediately without retries if it gets Unavailable (which is what it'll get from a frontend in that window before it's healthy). Is that expected/desired?

**What changed?** Add an interceptor to return Unavailable to WorkflowService methods until the frontend considers itself "healthy", which currently means "membership is initialized". **Why?** Fixes #5015 **How did you test it?** mostly manually **Potential risks** This adds a window of time where frontend can now return Unavailable where previously it might have succeeded or returned a different error code. Specifically note that client.Dial in go sdk (at least) will fail fast on this error and the caller will need to retry. --------- Co-authored-by: Tim Deeb-Swihart <[email protected]>

josh-berry added the potential-bug label Oct 20, 2023

josh-berry mentioned this issue Oct 20, 2023

Fix sources of test instability temporalio/cli#368

Merged

dnr mentioned this issue Nov 3, 2023

Return Unavailable to frontend rpcs until healthy #5069

Merged

dnr closed this as completed in #5069 Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

josh-berry commented Oct 20, 2023

dnr commented Oct 23, 2023

josh-berry commented Oct 25, 2023

cretz commented Oct 25, 2023 •

edited

Loading

dnr commented Oct 25, 2023

cretz commented Oct 25, 2023 •

edited

Loading

dnr commented Oct 25, 2023

cretz commented Oct 26, 2023 •

edited

Loading

dnr commented Oct 27, 2023

cretz commented Oct 27, 2023

dnr commented Oct 27, 2023

cretz commented Oct 30, 2023 •

edited

Loading

dnr commented Nov 3, 2023

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

Comments

josh-berry commented Oct 20, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

dnr commented Oct 23, 2023

josh-berry commented Oct 25, 2023

cretz commented Oct 25, 2023 • edited Loading

dnr commented Oct 25, 2023

cretz commented Oct 25, 2023 • edited Loading

dnr commented Oct 25, 2023

cretz commented Oct 26, 2023 • edited Loading

dnr commented Oct 27, 2023

cretz commented Oct 27, 2023

dnr commented Oct 27, 2023

cretz commented Oct 30, 2023 • edited Loading

dnr commented Nov 3, 2023

cretz commented Oct 25, 2023 •

edited

Loading

cretz commented Oct 25, 2023 •

edited

Loading

cretz commented Oct 26, 2023 •

edited

Loading

cretz commented Oct 30, 2023 •

edited

Loading