-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UNAVAILABLE is not a fatal error #1901
Conversation
This does not look like it will trigger the expected retries, as may be the case in 6205b41 as referenced in #1845. I'm also skeptical about the retry loop below, it appears that we can encounter a situation where our REAPI call is going to loop forever without any client side timeout or limit of number of retries (however that may be considered orthogonaly to this). |
I don't see how 6205b41 would retry in any case where this MR does not. So are you saying they are both wrong? |
The change here is that this will return last_operation instead of throwing exception upon UNAVAILABLE. last_operation may be None in which case the while loop exits. |
Ah you are correct, my bad. Instead it appears @abderrahim's patch 6205b41 is redundant inasmuch as it adds an explicit return where one should not be required. |
Currently our REAPI testing is very limited, we only test against buildgrid of a specific version, and we don't have anything resembling fault injection (i.e. we are not simulating heavy load situations in the REAPI services in order to observe how buildstream deals with quirky situations). In this light, I would be very happy to merge this based on someone at least manually testing their build and confirming this improves things without introducing obvious problems (but unfortunately: #1845 (comment)) Other reviewers can please feel free to green light this patch if they have high confidence based only on the API and specifications. |
Yeah, typically grpc failures do indeed mostly show up from freedesktop-sdk testing, it's not just RE. My impression from issue @abderrahim filed is that it's a critical bug which renders BuildStream 2 REAPI completely unusable with any non-trivial usage. |
With a reliable network connection to an available RE server, the client should never get the UNAVAILABLE error, even with non-trivial usage. However, we should indeed retry on UNAVAILABLE to improve operation when a server is temporarily unavailable for any reason. The retry logic is far from optimal, with and without this patch. E.g., there is no retry if the first response of That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of |
There is no such thing as "reliable connection" when you start scaling grpc. This has been proved over and over again with freedesktop-sdk. Things seem to work fine for a while, then you add a dozen BuildStream clients connecting to same server and suddenly they don't. This is evident by fixes both to BuildStream 1 and BuildStream 2 for this UNAVAILABLE case, it has needed fixing already into artifact and source cache cases. REAPI is just apparently more or less unused so it hasn't been fixed into good enough state for production like the other use cases yet. |
Can't we detect that it failed before first execute that running_operation is None? |
472f90f
to
c80d9d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is equivalent to 6205b41, and that one has improved things let's merge this. Even if it's not ideal, it should be an improvement.
Related to #1845