Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNAVAILABLE is not a fatal error #1901

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

nanonyme
Copy link
Contributor

Related to #1845

@gtristan
Copy link
Contributor

This does not look like it will trigger the expected retries, as may be the case in 6205b41 as referenced in #1845.

I'm also skeptical about the retry loop below, it appears that we can encounter a situation where our REAPI call is going to loop forever without any client side timeout or limit of number of retries (however that may be considered orthogonaly to this).

@nanonyme
Copy link
Contributor Author

I don't see how 6205b41 would retry in any case where this MR does not. So are you saying they are both wrong?

@nanonyme
Copy link
Contributor Author

The change here is that this will return last_operation instead of throwing exception upon UNAVAILABLE. last_operation may be None in which case the while loop exits.

@gtristan
Copy link
Contributor

The change here is that this will return last_operation instead of throwing exception upon UNAVAILABLE. last_operation may be None in which case the while loop exits.

Ah you are correct, my bad.

Instead it appears @abderrahim's patch 6205b41 is redundant inasmuch as it adds an explicit return where one should not be required.

@gtristan
Copy link
Contributor

Currently our REAPI testing is very limited, we only test against buildgrid of a specific version, and we don't have anything resembling fault injection (i.e. we are not simulating heavy load situations in the REAPI services in order to observe how buildstream deals with quirky situations).

In this light, I would be very happy to merge this based on someone at least manually testing their build and confirming this improves things without introducing obvious problems (but unfortunately: #1845 (comment))

Other reviewers can please feel free to green light this patch if they have high confidence based only on the API and specifications.

@nanonyme
Copy link
Contributor Author

nanonyme commented Mar 19, 2024

Yeah, typically grpc failures do indeed mostly show up from freedesktop-sdk testing, it's not just RE. My impression from issue @abderrahim filed is that it's a critical bug which renders BuildStream 2 REAPI completely unusable with any non-trivial usage.

@juergbi
Copy link
Contributor

juergbi commented Mar 19, 2024

With a reliable network connection to an available RE server, the client should never get the UNAVAILABLE error, even with non-trivial usage. However, we should indeed retry on UNAVAILABLE to improve operation when a server is temporarily unavailable for any reason.

The retry logic is far from optimal, with and without this patch. E.g., there is no retry if the first response of Execute() or WaitExecution() fails. And there is no retry delay with backoff or a maximum number of attempts. If the initial Execute() response is received but there is an error later on, it tries to immediately resume the operation stream by calling WaitExecution(). If that fails, it will give up (as last_operation will be None). If there is a temporary network or server issue, it is likely that the immediate retry fails as well.

That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of Execute() in which case there will be no retry, or the immediate retry fails as well). If last_operation is None, we should still raise that error, at least as long as we don't retry in that case.

@nanonyme
Copy link
Contributor Author

There is no such thing as "reliable connection" when you start scaling grpc. This has been proved over and over again with freedesktop-sdk. Things seem to work fine for a while, then you add a dozen BuildStream clients connecting to same server and suddenly they don't. This is evident by fixes both to BuildStream 1 and BuildStream 2 for this UNAVAILABLE case, it has needed fixing already into artifact and source cache cases. REAPI is just apparently more or less unused so it hasn't been fixed into good enough state for production like the other use cases yet.
I do agree the patch is primitive and most likely won't cover all cases. Considering UNAVAILABLE a fatal error (like the original code did) is simply wrong though.

@nanonyme
Copy link
Contributor Author

That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of Execute() in which case there will be no retry, or the immediate retry fails as well). If last_operation is None, we should still raise that error, at least as long as we don't retry in that case.

Can't we detect that it failed before first execute that running_operation is None?

@abderrahim abderrahim force-pushed the nanonyme/remote-execution branch from 472f90f to c80d9d5 Compare April 4, 2024 13:29
Copy link
Contributor

@abderrahim abderrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is equivalent to 6205b41, and that one has improved things let's merge this. Even if it's not ideal, it should be an improvement.

@abderrahim abderrahim merged commit 0bd7059 into apache:master Apr 16, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants