-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC health check may say the server is unhealthy even if it's responding successfully to GetSystemInfo #5015
Comments
|
It was decided when Can we choose not to start the gRPC server until it will be healthy? Same for our HTTP API and any other user-facing thing, if we can not serve until it can properly serve, that would be ideal. If we are going to serve even though we're not ready to serve, can we fail the |
We're talking from different perspectives: from the pov of an sdk talking to a properly configured temporal cluster, GetSystemInfo should be enough. But from the pov of a single frontend process answering rpcs that come from a load balancer or some kind of ingress controller, it wouldn't even expect to get external requests until it returns success to a health check. The problem is, what if you have an external client talking directly to a frontend without a load balancer. Then there's these mismatched expectations. I.e. I agree external clients should not need to do health checks, but some layer does. One option is to say the server is fine, and if you (i.e. a test environment) are starting a "bare" server, not in some kind of managed environment with controlled ingress, then you need to act like a load balancer and do a health check. But it's for the test environment to do, not the sdk. Alternatively, if we do decide we want to change the server: I don't think we want to move starting the grpc server until after healthy (healthy == joins membership successfully) for the reasons explained in #4510. (Also note that membership has no way to say "join, but not ready for requests yet".) It's true that that's mainly talking about history, and frontend has different constraints since its requests are mostly (but not entirely) external. So maybe we could swap the order for frontend. But that feels inconsistent. It's probably better to do the other thing, block GetSystemInfo on membership. That's slightly worrying, since in theory it could be useful to call that on a frontend that hasn't joined membership yet. But probably fine in practice. |
What can we do to make a single call on client connection to check health and get system info? We're only concerned with server startup health in this case.
I'd accept just failing it (or any call) when server not yet healthy on startup. Not talking about server health changing after startup, we just need to not be able to make this call successfully to an unhealthy endpoint on startup. |
What is "we"? My point is that there are multiple layers with different responsibilities. The thing that starts the server should do a health check before it tells anything to connect to the server, and the SDK should do GetSystemInfo. |
"We" as in Temporal company. Is there anything Temporal as a company can do to make sure a single call from an SDK on client connection both ensures server is ready to accept requests and returns system info?
Latter is already done, but I fear the former is asking too much of users to alter their SDK code to do additional health checks upon client connection if, say, they are starting their server somewhere else. Don't even need to make this for every call, can we alter |
How many bits of code are there out there that start a server for testing? There are a few libraries that everyone calls, right? |
We cannot know how many users start a server and then a client. It is very common in CI for example to just start a server container and do work. The goal is to not allow an SDK client connection (e.g. |
Ok, we discussed among the team and decided to make all frontend methods return an error if not healthy. |
Thanks! Just |
I tried this in #5069. One potential issue I noticed is that Dial will return an error immediately without retries if it gets Unavailable (which is what it'll get from a frontend in that window before it's healthy). Is that expected/desired? |
**What changed?** Add an interceptor to return Unavailable to WorkflowService methods until the frontend considers itself "healthy", which currently means "membership is initialized". **Why?** Fixes #5015 **How did you test it?** mostly manually **Potential risks** This adds a window of time where frontend can now return Unavailable where previously it might have succeeded or returned a different error code. Specifically note that client.Dial in go sdk (at least) will fail fast on this error and the caller will need to retry. --------- Co-authored-by: Tim Deeb-Swihart <[email protected]>
For full context, see the discussion on this PR: temporalio/cli#368 (comment)
Expected Behavior
If GetSystemInfo returns successfully, the gRPC health check should also pass.
Actual Behavior
For a period of up to about 1 second after GetSystemInfo succeeds, the gRPC health check may fail (returning NOT_SERVING), falsely indicating that gRPC is down when it's not.
This was causing frequent intermittent failures (such as this one) in the CLI CI/CD pipeline until we worked around it in temporalio/cli#368 .
Steps to Reproduce the Problem
Specifications
The text was updated successfully, but these errors were encountered: