UNAVAILABLE is not a fatal error #1901

nanonyme · 2024-03-17T14:58:42Z

Related to #1845

gtristan · 2024-03-18T10:35:59Z

This does not look like it will trigger the expected retries, as may be the case in 6205b41 as referenced in #1845.

I'm also skeptical about the retry loop below, it appears that we can encounter a situation where our REAPI call is going to loop forever without any client side timeout or limit of number of retries (however that may be considered orthogonaly to this).

nanonyme · 2024-03-18T15:25:48Z

I don't see how 6205b41 would retry in any case where this MR does not. So are you saying they are both wrong?

nanonyme · 2024-03-18T15:28:33Z

The change here is that this will return last_operation instead of throwing exception upon UNAVAILABLE. last_operation may be None in which case the while loop exits.

gtristan · 2024-03-19T05:36:45Z

The change here is that this will return last_operation instead of throwing exception upon UNAVAILABLE. last_operation may be None in which case the while loop exits.

Ah you are correct, my bad.

Instead it appears @abderrahim's patch 6205b41 is redundant inasmuch as it adds an explicit return where one should not be required.

gtristan · 2024-03-19T05:45:33Z

Currently our REAPI testing is very limited, we only test against buildgrid of a specific version, and we don't have anything resembling fault injection (i.e. we are not simulating heavy load situations in the REAPI services in order to observe how buildstream deals with quirky situations).

In this light, I would be very happy to merge this based on someone at least manually testing their build and confirming this improves things without introducing obvious problems (but unfortunately: #1845 (comment))

Other reviewers can please feel free to green light this patch if they have high confidence based only on the API and specifications.

nanonyme · 2024-03-19T18:57:54Z

Yeah, typically grpc failures do indeed mostly show up from freedesktop-sdk testing, it's not just RE. My impression from issue @abderrahim filed is that it's a critical bug which renders BuildStream 2 REAPI completely unusable with any non-trivial usage.

juergbi · 2024-03-19T19:28:27Z

With a reliable network connection to an available RE server, the client should never get the UNAVAILABLE error, even with non-trivial usage. However, we should indeed retry on UNAVAILABLE to improve operation when a server is temporarily unavailable for any reason.

The retry logic is far from optimal, with and without this patch. E.g., there is no retry if the first response of Execute() or WaitExecution() fails. And there is no retry delay with backoff or a maximum number of attempts. If the initial Execute() response is received but there is an error later on, it tries to immediately resume the operation stream by calling WaitExecution(). If that fails, it will give up (as last_operation will be None). If there is a temporary network or server issue, it is likely that the immediate retry fails as well.

That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of Execute() in which case there will be no retry, or the immediate retry fails as well). If last_operation is None, we should still raise that error, at least as long as we don't retry in that case.

nanonyme · 2024-03-19T20:08:22Z

There is no such thing as "reliable connection" when you start scaling grpc. This has been proved over and over again with freedesktop-sdk. Things seem to work fine for a while, then you add a dozen BuildStream clients connecting to same server and suddenly they don't. This is evident by fixes both to BuildStream 1 and BuildStream 2 for this UNAVAILABLE case, it has needed fixing already into artifact and source cache cases. REAPI is just apparently more or less unused so it hasn't been fixed into good enough state for production like the other use cases yet.
I do agree the patch is primitive and most likely won't cover all cases. Considering UNAVAILABLE a fatal error (like the original code did) is simply wrong though.

nanonyme · 2024-03-21T09:13:09Z

That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of Execute() in which case there will be no retry, or the immediate retry fails as well). If last_operation is None, we should still raise that error, at least as long as we don't retry in that case.

Can't we detect that it failed before first execute that running_operation is None?

abderrahim

Since this is equivalent to 6205b41, and that one has improved things let's merge this. Even if it's not ideal, it should be an improvement.

nanonyme mentioned this pull request Mar 17, 2024

Remote execution builds are failed on grpc status code UNAVAILABLE #1845

Open

nanonyme marked this pull request as ready for review March 18, 2024 10:04

nanonyme requested review from gtristan, juergbi, BenjaminSchubert, cs-shadow and abderrahim as code owners March 18, 2024 10:04

UNAVAILABLE is not a fatal error

c80d9d5

abderrahim force-pushed the nanonyme/remote-execution branch from 472f90f to c80d9d5 Compare April 4, 2024 13:29

abderrahim approved these changes Apr 16, 2024

View reviewed changes

abderrahim merged commit 0bd7059 into apache:master Apr 16, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNAVAILABLE is not a fatal error #1901

UNAVAILABLE is not a fatal error #1901

nanonyme commented Mar 17, 2024

gtristan commented Mar 18, 2024

nanonyme commented Mar 18, 2024

nanonyme commented Mar 18, 2024

gtristan commented Mar 19, 2024

gtristan commented Mar 19, 2024

nanonyme commented Mar 19, 2024 •

edited

Loading

juergbi commented Mar 19, 2024

nanonyme commented Mar 19, 2024

nanonyme commented Mar 21, 2024

abderrahim left a comment

UNAVAILABLE is not a fatal error #1901

UNAVAILABLE is not a fatal error #1901

Conversation

nanonyme commented Mar 17, 2024

gtristan commented Mar 18, 2024

nanonyme commented Mar 18, 2024

nanonyme commented Mar 18, 2024

gtristan commented Mar 19, 2024

gtristan commented Mar 19, 2024

nanonyme commented Mar 19, 2024 • edited Loading

juergbi commented Mar 19, 2024

nanonyme commented Mar 19, 2024

nanonyme commented Mar 21, 2024

abderrahim left a comment

Choose a reason for hiding this comment

nanonyme commented Mar 19, 2024 •

edited

Loading