Log errors for failed requests #7296

kachawla · 2024-03-09T02:36:12Z

Description

We currently lack error logging upon failure for many operations which hinders our ability to debug failures. This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

To address this, the proposed change is to move error logging to the setFailed method. By doing so, we ensure that error logging is not reliant on every individual operation to log errors.

Type of change

This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #7297

Signed-off-by: karishma-chawla <[email protected]>

radius-functional-tests · 2024-03-09T02:39:43Z

Radius functional test overview

🔍 Go to test action run

Name	Value
Repository	kachawla/radius
Commit ref	`84d4ca3`
Unique ID	5473055b59
Image tag	pr-5473055b59

Click here to see the list of tools in the current test run

gotestsum 1.10.0
KinD: v0.20.0
Dapr: 1.12.0
Azure KeyVault CSI driver: 1.4.2
Azure Workload identity webhook: 1.1.0
Bicep recipe location ghcr.io/radius-project/dev/test/functional/shared/recipes/<name>:pr-5473055b59
Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-5473055b59
controller test image location: ghcr.io/radius-project/dev/controller:pr-5473055b59
ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-5473055b59
deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting kubernetes functional tests...
⌛ Starting msgrp functional tests...
⌛ Starting daprrp functional tests...
⌛ Starting shared functional tests...
⌛ Starting samples functional tests...
⌛ Starting cli functional tests...
⌛ Starting ucp functional tests...
⌛ Starting datastoresrp functional tests...
✅ kubernetes functional tests succeeded
✅ msgrp functional tests succeeded
✅ ucp functional tests succeeded
✅ daprrp functional tests succeeded
✅ samples functional tests succeeded
✅ cli functional tests succeeded
✅ shared functional tests succeeded
✅ datastoresrp functional tests succeeded

rynowak · 2024-03-09T21:46:25Z

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Are there cases we return a failure from an async operation and it's not logged?

/cc @youngbupark

kachawla · 2024-03-10T21:14:56Z

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

rynowak · 2024-03-10T21:34:22Z

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.

In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Is the problem that there are two ways to represent a failure?

A go error.
A controller.Result with a failure status?

I'm pretty nervous about relying on "constructors" for any kind of important logic. Go is more like C than it is Java, there's no requirement to use a "constructor" and lots of alternative ways to initialize a struct.

kachawla · 2024-03-10T21:51:07Z

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.

In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

kachawla · 2024-03-11T16:18:15Z

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

youngbupark · 2024-03-12T20:43:44Z

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

Logging error for that case is important to understand server-side issue because retry failure could be caused by server side problem. That's why we not only return the error to client but also record the error on server-side log.

youngbupark · 2024-03-12T20:58:22Z

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

It is intended.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 260 in 02b7b20

    
           logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

If the operation is succeeded, result.Error must be nil. When logger processes this, it will be represented as JSON format log like below. The arguments from "success" until err are treated as a log dimension or attribution. Having such attributions such as success/code/provisioningState help indexing for search backend.

{"severity":"info","timestamp":"2024-03-12T20:43:42.002Z","name":"radius.radiusasyncworker","caller":"worker/worker.go:260","message":"Operation returned","serviceName":"radius","version":"edge","hostName":"applications-rp-bf77896d9-rfnk7","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/backendcontainerdns","operationId":"64d729f1-2936-48eb-aabe-9a5fcaee2626","operationType":"APPLICATIONS.CORE/CONTAINERS|PUT","dequeueCount":1,"traceId":"9684bed9d6f01c4db253ab973274dfd1","spanId":"a15c41cd88fea0e1","success":true,"code":"","provisioningState":"Succeeded","err":null}

youngbupark · 2024-03-12T21:04:42Z

pkg/armrpc/asyncoperation/worker/worker.go

@@ -266,8 +265,7 @@ func (w *AsyncRequestProcessWorker) runOperation(ctx context.Context, message *q
 		if !errors.Is(asyncReqCtx.Err(), context.Canceled) {
 			if err != nil {
 				armErr := extractError(err)
-				result.SetFailed(armErr, false)
-				logger.Error(err, "Operation Failed")


This is required for server side debugging. If we remove this, it is hard to debug only with server log. e.g. when we debug the problem with server log, it is hard to know which step has failed because there is no indication. Operation Failed will be the indicator at which step has been failed.

This is not being removed in this PR, but being abstracted out as I also had explained in the PR description.

youngbupark · 2024-03-12T21:09:50Z

pkg/armrpc/asyncoperation/controller/result.go

@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result {
 }

 // NewFailedResult creates a new Result object with the given error details and sets the failed flag to true.
-func NewFailedResult(err v1.ErrorDetails) Result {
+func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result {


What's the benefit to have err arg here? The initial idea is that we want to have a detail error description specified by Controller instead of just golang error. So we can deliver the detail error to clients.

You can see how it is consumed by this function to understand the reasoning behind adding it in parameters here.

I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

JFYI, all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended. Let's chat offline if you need more info here.

Frontend controller interface

radius/pkg/armrpc/frontend/controller/controller.go

Lines 103 to 106 in c77088f

type Controller interface {

// Run executes the operation.

Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error)

}

Backend controller interface

radius/pkg/armrpc/asyncoperation/controller/controller.go

Line 50 in c77088f

Run(ctx context.Context, request *Request) (Result, error)

I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there.

I do think this should be added to cancelled requests, the same problem exists there. I intentionally left it out of the first iteration of this PR to agree on the approach and then extend it to cancelled.

I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

I did call out the why in other comments. Here is one problem I called out #7296 (comment), another here: #7296 (comment). Can you provide some context on why the original design was implemented expecting each controller to log the error?

all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended.

I understand that. This PR is not changing controller interfaces or the result, it is purely abstracting out logging into where the result is set so that it is guaranteed to be logged. If NewFailedResult/SetFailed is called, there should be a corresponding error that is causing it to get to this state. If the same goal can be achieved by moving the logging to async worker, that's fine by me, as long as we have a shared component/piece of code that guarantees logging, and isn't leaving it to human memory.

Let's use the next design meeting to understand and figure out the path forward on this.

If the error was coming from the following code in the original issue,

radius/pkg/portableresources/processors/resourceclient.go

Line 279 in c77088f

return "", fmt.Errorf("could not find API version for type %q, type was not found", id.Type())

All errors related to bicep should be caught in the following code first and logged before reaching worker's error handling code.

radius/pkg/recipes/driver/bicep.go

Lines 212 to 226 in c77088f

err = d.ResourceClient.Delete(ctx, id)

if err != nil {

if attempt <= d.options.DeleteRetryCount {

logger.V(ucplog.LevelInfo).Error(err, "attempt failed", "delay", d.options.DeleteRetryDelaySeconds)

time.Sleep(time.Duration(d.options.DeleteRetryDelaySeconds) * time.Second)

continue

}

return recipes.NewRecipeError(recipes.RecipeDeletionFailed, err.Error(), "", recipes.GetErrorDetails(err))

}

// If the err is nil, then the resource is deleted successfully

logger.V(ucplog.LevelInfo).Info(fmt.Sprintf("Deleted output resource: %q", id))

return nil

}

But I could not find this error logs in the log files. It means that the request wasn't processed by worker.

So the test that failed doesn't use a recipe and won't go inside the driver code, which is why the log above isn't present. Looking through the logs and all the resources that the failed test creates, it failed during deletion of the container resource and the error in the test failure is likely coming from here. I'm doing further investigation to see why the error wasn't logged.

The trace id that I see in the logs for deletion is different from the trace id in the functional test failure logs - so that's another area which needs further digging.

{"severity":"info","timestamp":"2024-03-04T21:30:59.433Z","name":"radius.radiusasyncworker","caller":"deployment/deploymentprocessor.go:341","message":"Deleting output resource: LocalID: KubernetesRole, resource type: \"Provider: kubernetes, Type: rbac.authorization.k8s.io/Role\"\n","serviceName":"radius","version":"edge","hostName":"applications-rp-7b774d7494-k6dv2","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/dapr-sts-manual-ctnr","operationId":"489b1b1a-36dc-4219-bbd5-48ef3d23e90d","operationType":"APPLICATIONS.CORE/CONTAINERS|DELETE","dequeueCount":1,"traceId":"6cde20fc5dcbe244c35799b5ff94fea6","spanId":"35daee63ef2a07aa"}

I do not think that it happened while deleting secret as part of container deletion. Please see the following log:

It went further after deleting secret resource.

youngbupark · 2024-03-12T21:10:08Z

pkg/armrpc/asyncoperation/controller/result.go

 	return r
 }

 // SetFailed sets the error response with Failed status.
-func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
+func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {


This is being used on L57.

kachawla · 2024-03-17T21:16:02Z

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

Logging error for that case is important to understand server-side issue because retry failure could be caused by server side problem. That's why we not only return the error to client but also record the error on server-side log.

Well that's the problem I'm fixing in this PR. Errors aren't being logged today preventing us from being able to effectively debug issues. We have seen many instances of this, here is an example for your reference #7264

kachawla · 2024-03-17T21:18:56Z

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

It is intended.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 260 in 02b7b20

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)
logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)
If the operation is succeeded, result.Error must be nil. When logger processes this, it will be represented as JSON format log like below. The arguments from "success" until err are treated as a log dimension or attribution. Having such attributions such as success/code/provisioningState help indexing for search backend.
{"severity":"info","timestamp":"2024-03-12T20:43:42.002Z","name":"radius.radiusasyncworker","caller":"worker/worker.go:260","message":"Operation returned","serviceName":"radius","version":"edge","hostName":"applications-rp-bf77896d9-rfnk7","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/backendcontainerdns","operationId":"64d729f1-2936-48eb-aabe-9a5fcaee2626","operationType":"APPLICATIONS.CORE/CONTAINERS|PUT","dequeueCount":1,"traceId":"9684bed9d6f01c4db253ab973274dfd1","spanId":"a15c41cd88fea0e1","success":true,"code":"","provisioningState":"Succeeded","err":null}

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

youngbupark · 2024-03-18T20:42:35Z

pkg/armrpc/asyncoperation/controller/result.go

@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result {
 }

 // NewFailedResult creates a new Result object with the given error details and sets the failed flag to true.
-func NewFailedResult(err v1.ErrorDetails) Result {
+func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result {


I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

JFYI, all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended. Let's chat offline if you need more info here.

Frontend controller interface

radius/pkg/armrpc/frontend/controller/controller.go

Lines 103 to 106 in c77088f

type Controller interface {

// Run executes the operation.

Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error)

}

Backend controller interface

radius/pkg/armrpc/asyncoperation/controller/controller.go

Line 50 in c77088f

Run(ctx context.Context, request *Request) (Result, error)

youngbupark · 2024-03-18T20:44:09Z

pkg/armrpc/asyncoperation/controller/result.go

-func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
+func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {
+	logger := ucplog.FromContextOrDiscard(ctx)
+	logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message))


As I mentioned above, if we want to log out for the failure, cancellation is also failure. we need to put the same code there. We might not want to have the same code here and there. So we have logged the failure or status in main worker loop. I think we first need to figure it out why the current logging is not working as expected.

I think we first need to figure it out why the current logging is not working as expected.

Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places. The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.

I think we first need to figure it out why the current logging is not working as expected.

Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places.

To be clear, we have logged error individually in the main worker loop, not in controller. When controller creates Result, it needs to be expected to return Result and error. then the following code must log out the error. This the centralized logging part for all controllers.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Lines 253 to 260 in c77088f

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

code := ""

if result.Error != nil {

code = result.Error.Code

}

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.

I think the PR is the workaround. We first need to understand the root cause and then have the right fix here. in this comments - #7297 (comment), we might not be able to repro the same issue.

I do not know what happens. but it might have numerous reasons. For example, RP could crash before logging. Another hypothesis is that when workflow step collects the logs, the log might be missing. Functional tests save the RP logs by using kubectl logs, which is not reliable.

@youngbupark Let me ask this to build a better understanding of the current state -

Can you share link to where the code is that logs the error caught here?

Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

@youngbupark Let me ask this to build a better understanding of the current state -

Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 253 in 2a681be

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

@youngbupark Let me ask this to build a better understanding of the current state -

Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 253 in 2a681be

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.

Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

My question is - is this redundant?

@youngbupark Let me ask this to build a better understanding of the current state -

Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 253 in 2a681be

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.

There are two places:

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 260 in 2a681be

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

radius/pkg/armrpc/asyncoperation/worker/worker.go

Line 270 in 2a681be

logger.Error(err, "Operation Failed")

And I strongly recommend to debug the code flow to understand the details.

Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

My question is - is this redundant?

As I said, "It may or may not". In general, If the log is useful for debugging the server, then it should not be redundant. From what I read, it is more readable than just printing out pure go error. I do not think it is redundant. To understand why, I would reach out to yetkin who contribute to this code.

If we are logging the same error twice, then that is redundant imo. It was added originally because the error wasn't getting logged anywhere else (same problem that this PR is trying to solve).

youngbupark · 2024-03-18T21:17:49Z

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

kachawla · 2024-03-18T21:28:52Z

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

youngbupark · 2024-03-18T21:42:53Z

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

I do not have the strong opinion to use logger.Error and logger.Info separately or not here. If you think using the different logger helper method is right for error case, it might be good to add logger.Error to this code instead of changing the other codes including NewFailedResult helper.

BTW, I dug it out the original issue, please see my comment - #7297 (comment)

kachawla · 2024-03-18T22:17:34Z

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

I do not have the strong opinion to use logger.Error and logger.Info separately or not here. If you think using the different logger helper method is right for error case, it might be good to add logger.Error to this code instead of changing the other codes including NewFailedResult helper.

BTW, I dug it out the original issue, please see my comment - #7297 (comment)

Ok I will look into fixing it in async worker. Will sync up with you offline on these questions I have for my clarity #7296 (comment)

kachawla · 2024-03-21T16:50:09Z

There are multiple problems that clearly need further investigation (missing trace id, logs) so closing this PR at this point. I'll look into the next steps and either create follow up issues or PR.

youngbupark · 2024-03-21T16:51:14Z

pkg/armrpc/asyncoperation/controller/result.go

-func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
+func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {
+	logger := ucplog.FromContextOrDiscard(ctx)
+	logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message))


Another downside to add logging here is that whenever backend controller calls this method, it will log out. Basically, backend controller devs wants to print out log when the error is ready. All these helpers is to construct and set the failure reason, not print out the log. adding logging here doesn't seem right..

Log error for failed requests

84d4ca3

Signed-off-by: karishma-chawla <[email protected]>

kachawla requested review from a team as code owners March 9, 2024 02:36

kachawla temporarily deployed to functional-tests March 9, 2024 02:39 — with GitHub Actions Inactive

kachawla mentioned this pull request Mar 9, 2024

Scheduled long running test failed - Run ID: 8150844893 #7264

Closed

youngbupark reviewed Mar 12, 2024

View reviewed changes

youngbupark suggested changes Mar 18, 2024

View reviewed changes

kachawla closed this Mar 21, 2024

youngbupark reviewed Mar 21, 2024

View reviewed changes

	type Controller interface {
	// Run executes the operation.
	Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error)
	}

	err = d.ResourceClient.Delete(ctx, id)
	if err != nil {
	if attempt <= d.options.DeleteRetryCount {
	logger.V(ucplog.LevelInfo).Error(err, "attempt failed", "delay", d.options.DeleteRetryDelaySeconds)
	time.Sleep(time.Duration(d.options.DeleteRetryDelaySeconds) * time.Second)
	continue
	}

	return recipes.NewRecipeError(recipes.RecipeDeletionFailed, err.Error(), "", recipes.GetErrorDetails(err))
	}

	// If the err is nil, then the resource is deleted successfully
	logger.V(ucplog.LevelInfo).Info(fmt.Sprintf("Deleted output resource: %q", id))
	return nil
	}

	result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

	code := ""
	if result.Error != nil {
	code = result.Error.Code
	}

	logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

Log errors for failed requests #7296

Log errors for failed requests #7296

Conversation

kachawla commented Mar 9, 2024 • edited Loading

Description

Type of change

radius-functional-tests bot commented Mar 9, 2024 • edited Loading

Radius functional test overview

Test Status

rynowak commented Mar 9, 2024

kachawla commented Mar 10, 2024 • edited Loading

rynowak commented Mar 10, 2024 • edited Loading

kachawla commented Mar 10, 2024

kachawla commented Mar 11, 2024 • edited Loading

youngbupark commented Mar 12, 2024

youngbupark commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachawla Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

youngbupark Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachawla Mar 17, 2024 • edited Loading

Choose a reason for hiding this comment

kachawla commented Mar 17, 2024

kachawla commented Mar 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachawla Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngbupark Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

kachawla Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

youngbupark commented Mar 18, 2024

kachawla commented Mar 18, 2024

youngbupark commented Mar 18, 2024 • edited Loading

kachawla commented Mar 18, 2024

kachawla commented Mar 21, 2024

youngbupark Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

kachawla commented Mar 9, 2024 •

edited

Loading

radius-functional-tests bot commented Mar 9, 2024 •

edited

Loading

kachawla commented Mar 10, 2024 •

edited

Loading

rynowak commented Mar 10, 2024 •

edited

Loading

kachawla commented Mar 11, 2024 •

edited

Loading

kachawla Mar 18, 2024 •

edited

Loading

youngbupark Mar 19, 2024 •

edited

Loading

kachawla Mar 17, 2024 •

edited

Loading

kachawla Mar 18, 2024 •

edited

Loading

youngbupark Mar 21, 2024 •

edited

Loading

kachawla Mar 21, 2024 •

edited

Loading

youngbupark commented Mar 18, 2024 •

edited

Loading

youngbupark Mar 21, 2024 •

edited

Loading