Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015

Closed
josh-berry opened this issue Oct 20, 2023 · 12 comments · Fixed by #5069

Comments

@josh-berry
Copy link

For full context, see the discussion on this PR: temporalio/cli#368 (comment)

Expected Behavior

If GetSystemInfo returns successfully, the gRPC health check should also pass.

Actual Behavior

For a period of up to about 1 second after GetSystemInfo succeeds, the gRPC health check may fail (returning NOT_SERVING), falsely indicating that gRPC is down when it's not.

This was causing frequent intermittent failures (such as this one) in the CLI CI/CD pipeline until we worked around it in temporalio/cli#368 .

Steps to Reproduce the Problem

  1. Launch the server
  2. Immediately try to connect to it using the Go SDK. (The Go SDK will wait for a successful GetSystemInfo response before returning a client object to the caller.)
  3. Once the Go SDK returns a client object, immediately use the client object to perform a health check of the server.
  4. Intermittently, the health check will fail.

Specifications

  • Version: 1.22.0
  • Platform: Seen on all platforms
@dnr
Copy link
Member

dnr commented Oct 23, 2023

  1. It looks like the cli isn't using the service-specific health check (it's passing an empty request, so the empty string for service name). I think it should check for health of WorkflowService specifically.

  2. I actually disagree this is incorrect behavior. Just because one rpc returns successfully doesn't mean that the service is healthy. The health indication might be looking at other things not relevant to that particular rpc method. The implication should go the other way: if the health check returns success, then all rpcs should return successfully (assuming valid requests, etc.). I.e. trust the health check over GetSystemInfo.

@josh-berry
Copy link
Author

@dnr Thanks, makes sense. @cretz FYI in case we need to change what the SDKs are doing.

@cretz
Copy link
Member

cretz commented Oct 25, 2023

I actually disagree this is incorrect behavior. Just because one rpc returns successfully doesn't mean that the service is healthy.

It was decided when GetSystemInfo was created that it would replace health checks (temporalio/sdk-go#706) but maybe we weren't clear enough when describing the purpose when we created it. We don't want to force every gRPC client to make two RPC calls on every client connection just for info + health.

Can we choose not to start the gRPC server until it will be healthy? Same for our HTTP API and any other user-facing thing, if we can not serve until it can properly serve, that would be ideal. If we are going to serve even though we're not ready to serve, can we fail the GetSystemInfo check specifically so every client doesn't have to make two calls because we're responding to calls on a not-yet-healthy server?

@dnr
Copy link
Member

dnr commented Oct 25, 2023

We're talking from different perspectives: from the pov of an sdk talking to a properly configured temporal cluster, GetSystemInfo should be enough. But from the pov of a single frontend process answering rpcs that come from a load balancer or some kind of ingress controller, it wouldn't even expect to get external requests until it returns success to a health check.

The problem is, what if you have an external client talking directly to a frontend without a load balancer. Then there's these mismatched expectations. I.e. I agree external clients should not need to do health checks, but some layer does.

One option is to say the server is fine, and if you (i.e. a test environment) are starting a "bare" server, not in some kind of managed environment with controlled ingress, then you need to act like a load balancer and do a health check. But it's for the test environment to do, not the sdk.

Alternatively, if we do decide we want to change the server: I don't think we want to move starting the grpc server until after healthy (healthy == joins membership successfully) for the reasons explained in #4510. (Also note that membership has no way to say "join, but not ready for requests yet".) It's true that that's mainly talking about history, and frontend has different constraints since its requests are mostly (but not entirely) external. So maybe we could swap the order for frontend. But that feels inconsistent. It's probably better to do the other thing, block GetSystemInfo on membership. That's slightly worrying, since in theory it could be useful to call that on a frontend that hasn't joined membership yet. But probably fine in practice.

@cretz
Copy link
Member

cretz commented Oct 25, 2023

What can we do to make a single call on client connection to check health and get system info? We're only concerned with server startup health in this case.

It's probably better to do the other thing, block GetSystemInfo on membership.

I'd accept just failing it (or any call) when server not yet healthy on startup. Not talking about server health changing after startup, we just need to not be able to make this call successfully to an unhealthy endpoint on startup.

@dnr
Copy link
Member

dnr commented Oct 25, 2023

What is "we"? My point is that there are multiple layers with different responsibilities. The thing that starts the server should do a health check before it tells anything to connect to the server, and the SDK should do GetSystemInfo.

@cretz
Copy link
Member

cretz commented Oct 26, 2023

"We" as in Temporal company. Is there anything Temporal as a company can do to make sure a single call from an SDK on client connection both ensures server is ready to accept requests and returns system info?

The thing that starts the server should do a health check before it tells anything to connect to the server, and the SDK should do GetSystemInfo.

Latter is already done, but I fear the former is asking too much of users to alter their SDK code to do additional health checks upon client connection if, say, they are starting their server somewhere else. Don't even need to make this for every call, can we alter GetSystemInfo to do that health check at the top and error if it doesn't pass? I can submit the PR here if needed.

@dnr
Copy link
Member

dnr commented Oct 27, 2023

How many bits of code are there out there that start a server for testing? There are a few libraries that everyone calls, right?

@cretz
Copy link
Member

cretz commented Oct 27, 2023

We cannot know how many users start a server and then a client. It is very common in CI for example to just start a server container and do work. The goal is to not allow an SDK client connection (e.g. client.Dial) to return successfully if the server is not ready. How that server starts is unrelated.

@dnr
Copy link
Member

dnr commented Oct 27, 2023

Ok, we discussed among the team and decided to make all frontend methods return an error if not healthy.

@cretz
Copy link
Member

cretz commented Oct 30, 2023

Thanks! Just GetSystemInfo works for us, but this approach is even better assuming y'all don't mind the extra checks/interceptor. Now load balancers and retrying clients and such can eagerly know their call will not succeed. I do not believe there is any rush here.

@dnr
Copy link
Member

dnr commented Nov 3, 2023

I tried this in #5069. One potential issue I noticed is that Dial will return an error immediately without retries if it gets Unavailable (which is what it'll get from a frontend in that window before it's healthy). Is that expected/desired?

@dnr dnr closed this as completed in #5069 Dec 12, 2023
dnr added a commit that referenced this issue Dec 12, 2023
**What changed?**
Add an interceptor to return Unavailable to WorkflowService methods
until the frontend considers itself "healthy", which currently means
"membership is initialized".

**Why?**
Fixes #5015 

**How did you test it?**
mostly manually

**Potential risks**
This adds a window of time where frontend can now return Unavailable
where previously it might have succeeded or returned a different error
code. Specifically note that client.Dial in go sdk (at least) will fail
fast on this error and the caller will need to retry.

---------

Co-authored-by: Tim Deeb-Swihart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants