-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log errors for failed requests #7296
Conversation
Signed-off-by: karishma-chawla <[email protected]>
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Can you explain this in more detail, especially the part about errors being propagated vs not propagated? Are there cases we return a failure from an async operation and it's not logged? /cc @youngbupark |
Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.
Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation. |
I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right. In the case we're talking about, I'd call this region of the code the place "where the error is handled". Is the problem that there are two ways to represent a failure?
I'm pretty nervous about relying on "constructors" for any kind of important logic. Go is more like C than it is Java, there's no requirement to use a "constructor" and lots of alternative ways to initialize a struct. |
Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?
The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests. |
This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding |
Logging error for that case is important to understand server-side issue because retry failure could be caused by server side problem. That's why we not only return the error to client but also record the error on server-side log. |
It is intended.
logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err) If the operation is succeeded, result.Error must be nil. When logger processes this, it will be represented as JSON format log like below. The arguments from
|
@@ -266,8 +265,7 @@ func (w *AsyncRequestProcessWorker) runOperation(ctx context.Context, message *q | |||
if !errors.Is(asyncReqCtx.Err(), context.Canceled) { | |||
if err != nil { | |||
armErr := extractError(err) | |||
result.SetFailed(armErr, false) | |||
logger.Error(err, "Operation Failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for server side debugging. If we remove this, it is hard to debug only with server log. e.g. when we debug the problem with server log, it is hard to know which step has failed because there is no indication. Operation Failed will be the indicator at which step has been failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being removed in this PR, but being abstracted out as I also had explained in the PR description.
@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result { | |||
} | |||
|
|||
// NewFailedResult creates a new Result object with the given error details and sets the failed flag to true. | |||
func NewFailedResult(err v1.ErrorDetails) Result { | |||
func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit to have err arg here? The initial idea is that we want to have a detail error description specified by Controller instead of just golang error. So we can deliver the detail error to clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see how it is consumed by this function to understand the reasoning behind adding it in parameters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.
JFYI, all these helpers in this package are to create ctrl.Result
with ARM compatible error like we use rest.Response
in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response
for frontend and ctrl.Result
for backend. If the error is internal error or hard to be categorized, controller can return nil, err
. (err is go error), which is not recommended. Let's chat offline if you need more info here.
- Frontend controller interface
radius/pkg/armrpc/frontend/controller/controller.go
Lines 103 to 106 in c77088f
type Controller interface { | |
// Run executes the operation. | |
Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error) | |
} |
- Backend controller interface
Run(ctx context.Context, request *Request) (Result, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there.
I do think this should be added to cancelled requests, the same problem exists there. I intentionally left it out of the first iteration of this PR to agree on the approach and then extend it to cancelled.
I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.
I did call out the why in other comments. Here is one problem I called out #7296 (comment), another here: #7296 (comment). Can you provide some context on why the original design was implemented expecting each controller to log the error?
all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended.
I understand that. This PR is not changing controller interfaces or the result, it is purely abstracting out logging into where the result is set so that it is guaranteed to be logged. If NewFailedResult/SetFailed is called, there should be a corresponding error that is causing it to get to this state. If the same goal can be achieved by moving the logging to async worker, that's fine by me, as long as we have a shared component/piece of code that guarantees logging, and isn't leaving it to human memory.
Let's use the next design meeting to understand and figure out the path forward on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the error was coming from the following code in the original issue,
return "", fmt.Errorf("could not find API version for type %q, type was not found", id.Type()) |
All errors related to bicep should be caught in the following code first and logged before reaching worker's error handling code.
radius/pkg/recipes/driver/bicep.go
Lines 212 to 226 in c77088f
err = d.ResourceClient.Delete(ctx, id) | |
if err != nil { | |
if attempt <= d.options.DeleteRetryCount { | |
logger.V(ucplog.LevelInfo).Error(err, "attempt failed", "delay", d.options.DeleteRetryDelaySeconds) | |
time.Sleep(time.Duration(d.options.DeleteRetryDelaySeconds) * time.Second) | |
continue | |
} | |
return recipes.NewRecipeError(recipes.RecipeDeletionFailed, err.Error(), "", recipes.GetErrorDetails(err)) | |
} | |
// If the err is nil, then the resource is deleted successfully | |
logger.V(ucplog.LevelInfo).Info(fmt.Sprintf("Deleted output resource: %q", id)) | |
return nil | |
} |
But I could not find this error logs in the log files. It means that the request wasn't processed by worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the test that failed doesn't use a recipe and won't go inside the driver code, which is why the log above isn't present. Looking through the logs and all the resources that the failed test creates, it failed during deletion of the container resource and the error in the test failure is likely coming from here. I'm doing further investigation to see why the error wasn't logged.
The trace id that I see in the logs for deletion is different from the trace id in the functional test failure logs - so that's another area which needs further digging.
{"severity":"info","timestamp":"2024-03-04T21:30:59.433Z","name":"radius.radiusasyncworker","caller":"deployment/deploymentprocessor.go:341","message":"Deleting output resource: LocalID: KubernetesRole, resource type: \"Provider: kubernetes, Type: rbac.authorization.k8s.io/Role\"\n","serviceName":"radius","version":"edge","hostName":"applications-rp-7b774d7494-k6dv2","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/dapr-sts-manual-ctnr","operationId":"489b1b1a-36dc-4219-bbd5-48ef3d23e90d","operationType":"APPLICATIONS.CORE/CONTAINERS|DELETE","dequeueCount":1,"traceId":"6cde20fc5dcbe244c35799b5ff94fea6","spanId":"35daee63ef2a07aa"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return r | ||
} | ||
|
||
// SetFailed sets the error response with Failed status. | ||
func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) { | ||
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being used on L57.
Well that's the problem I'm fixing in this PR. Errors aren't being logged today preventing us from being able to effectively debug issues. We have seen many instances of this, here is an example for your reference #7264 |
The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for |
@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result { | |||
} | |||
|
|||
// NewFailedResult creates a new Result object with the given error details and sets the failed flag to true. | |||
func NewFailedResult(err v1.ErrorDetails) Result { | |||
func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.
JFYI, all these helpers in this package are to create ctrl.Result
with ARM compatible error like we use rest.Response
in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response
for frontend and ctrl.Result
for backend. If the error is internal error or hard to be categorized, controller can return nil, err
. (err is go error), which is not recommended. Let's chat offline if you need more info here.
- Frontend controller interface
radius/pkg/armrpc/frontend/controller/controller.go
Lines 103 to 106 in c77088f
type Controller interface { | |
// Run executes the operation. | |
Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error) | |
} |
- Backend controller interface
Run(ctx context.Context, request *Request) (Result, error) |
func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) { | ||
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) { | ||
logger := ucplog.FromContextOrDiscard(ctx) | ||
logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, if we want to log out for the failure, cancellation is also failure. we need to put the same code there. We might not want to have the same code here and there. So we have logged the failure or status in main worker loop. I think we first need to figure it out why the current logging is not working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we first need to figure it out why the current logging is not working as expected.
Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places. The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we first need to figure it out why the current logging is not working as expected.
Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places.
To be clear, we have logged error individually in the main worker loop, not in controller. When controller creates Result, it needs to be expected to return Result and error. then the following code must log out the error. This the centralized logging part for all controllers.
radius/pkg/armrpc/asyncoperation/worker/worker.go
Lines 253 to 260 in c77088f
result, err := asyncCtrl.Run(asyncReqCtx, asyncReq) | |
code := "" | |
if result.Error != nil { | |
code = result.Error.Code | |
} | |
logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err) |
The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.
I think the PR is the workaround. We first need to understand the root cause and then have the right fix here. in this comments - #7297 (comment), we might not be able to repro the same issue.
I do not know what happens. but it might have numerous reasons. For example, RP could crash before logging. Another hypothesis is that when workflow step collects the logs, the log might be missing. Functional tests save the RP logs by using kubectl logs, which is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngbupark Let me ask this to build a better understanding of the current state -
-
Can you share link to where the code is that logs the error caught here?
-
Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngbupark Let me ask this to build a better understanding of the current state -
- Can you share link to where the code is that logs the error caught here?
This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {
.
result, err := asyncCtrl.Run(asyncReqCtx, asyncReq) |
For additional code flow, I recommend to debug the code with the debugger.
- Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?
This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngbupark Let me ask this to build a better understanding of the current state -
- Can you share link to where the code is that logs the error caught here?
This is a backend controller. asyncCtrl.Run calls
func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {
.
result, err := asyncCtrl.Run(asyncReqCtx, asyncReq) For additional code flow, I recommend to debug the code with the debugger.
I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.
- Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?
This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.
My question is - is this redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngbupark Let me ask this to build a better understanding of the current state -
- Can you share link to where the code is that logs the error caught here?
This is a backend controller. asyncCtrl.Run calls
func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {
.
result, err := asyncCtrl.Run(asyncReqCtx, asyncReq) For additional code flow, I recommend to debug the code with the debugger.
I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.
There are two places:
logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err) |
logger.Error(err, "Operation Failed") |
And I strongly recommend to debug the code flow to understand the details.
- Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?
This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.
My question is - is this redundant?
As I said, "It may or may not". In general, If the log is useful for debugging the server, then it should not be redundant. From what I read, it is more readable than just printing out pure go error. I do not think it is redundant. To understand why, I would reach out to yetkin who contribute to this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are logging the same error twice, then that is redundant imo. It was added originally because the error wasn't getting logged anywhere else (same problem that this PR is trying to solve).
Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario. |
I'm not following this. Are you agreeing that there is a problem with the code I linked above? |
I do not have the strong opinion to use logger.Error and logger.Info separately or not here. If you think using the different logger helper method is right for error case, it might be good to add logger.Error to this code instead of changing the other codes including NewFailedResult helper. BTW, I dug it out the original issue, please see my comment - #7297 (comment) |
Ok I will look into fixing it in async worker. Will sync up with you offline on these questions I have for my clarity #7296 (comment) |
There are multiple problems that clearly need further investigation (missing trace id, logs) so closing this PR at this point. I'll look into the next steps and either create follow up issues or PR. |
func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) { | ||
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) { | ||
logger := ucplog.FromContextOrDiscard(ctx) | ||
logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another downside to add logging here is that whenever backend controller calls this method, it will log out. Basically, backend controller devs wants to print out log when the error is ready. All these helpers is to construct and set the failure reason, not print out the log. adding logging here doesn't seem right..
Description
We currently lack error logging upon failure for many operations which hinders our ability to debug failures. This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.
To address this, the proposed change is to move error logging to the setFailed method. By doing so, we ensure that error logging is not reliant on every individual operation to log errors.
Type of change
Fixes: #7297