Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log errors for failed requests #7296

Closed
wants to merge 1 commit into from

Conversation

kachawla
Copy link
Contributor

@kachawla kachawla commented Mar 9, 2024

Description

We currently lack error logging upon failure for many operations which hinders our ability to debug failures. This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

To address this, the proposed change is to move error logging to the setFailed method. By doing so, we ensure that error logging is not reliant on every individual operation to log errors.

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #7297

Signed-off-by: karishma-chawla <[email protected]>
@kachawla kachawla requested review from a team as code owners March 9, 2024 02:36
@kachawla kachawla temporarily deployed to functional-tests March 9, 2024 02:39 — with GitHub Actions Inactive
@radius-functional-tests
Copy link

radius-functional-tests bot commented Mar 9, 2024

Radius functional test overview

🔍 Go to test action run

Name Value
Repository kachawla/radius
Commit ref 84d4ca3
Unique ID 5473055b59
Image tag pr-5473055b59
Click here to see the list of tools in the current test run
  • gotestsum 1.10.0
  • KinD: v0.20.0
  • Dapr: 1.12.0
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.1.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/functional/shared/recipes/<name>:pr-5473055b59
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-5473055b59
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-5473055b59
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-5473055b59
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting kubernetes functional tests...
⌛ Starting msgrp functional tests...
⌛ Starting daprrp functional tests...
⌛ Starting shared functional tests...
⌛ Starting samples functional tests...
⌛ Starting cli functional tests...
⌛ Starting ucp functional tests...
⌛ Starting datastoresrp functional tests...
✅ kubernetes functional tests succeeded
✅ msgrp functional tests succeeded
✅ ucp functional tests succeeded
✅ daprrp functional tests succeeded
✅ samples functional tests succeeded
✅ cli functional tests succeeded
✅ shared functional tests succeeded
✅ datastoresrp functional tests succeeded

@rynowak
Copy link
Contributor

rynowak commented Mar 9, 2024

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Are there cases we return a failure from an async operation and it's not logged?

/cc @youngbupark

@kachawla
Copy link
Contributor Author

kachawla commented Mar 10, 2024

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

@rynowak
Copy link
Contributor

rynowak commented Mar 10, 2024

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.

In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Is the problem that there are two ways to represent a failure?

  • A go error.
  • A controller.Result with a failure status?

I'm pretty nervous about relying on "constructors" for any kind of important logic. Go is more like C than it is Java, there's no requirement to use a "constructor" and lots of alternative ways to initialize a struct.

@kachawla
Copy link
Contributor Author

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.

In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

@kachawla
Copy link
Contributor Author

kachawla commented Mar 11, 2024

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

@youngbupark
Copy link

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

Logging error for that case is important to understand server-side issue because retry failure could be caused by server side problem. That's why we not only return the error to client but also record the error on server-side log.

@youngbupark
Copy link

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

It is intended.

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

If the operation is succeeded, result.Error must be nil. When logger processes this, it will be represented as JSON format log like below. The arguments from "success" until err are treated as a log dimension or attribution. Having such attributions such as success/code/provisioningState help indexing for search backend.

{"severity":"info","timestamp":"2024-03-12T20:43:42.002Z","name":"radius.radiusasyncworker","caller":"worker/worker.go:260","message":"Operation returned","serviceName":"radius","version":"edge","hostName":"applications-rp-bf77896d9-rfnk7","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/backendcontainerdns","operationId":"64d729f1-2936-48eb-aabe-9a5fcaee2626","operationType":"APPLICATIONS.CORE/CONTAINERS|PUT","dequeueCount":1,"traceId":"9684bed9d6f01c4db253ab973274dfd1","spanId":"a15c41cd88fea0e1","success":true,"code":"","provisioningState":"Succeeded","err":null}

@@ -266,8 +265,7 @@ func (w *AsyncRequestProcessWorker) runOperation(ctx context.Context, message *q
if !errors.Is(asyncReqCtx.Err(), context.Canceled) {
if err != nil {
armErr := extractError(err)
result.SetFailed(armErr, false)
logger.Error(err, "Operation Failed")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for server side debugging. If we remove this, it is hard to debug only with server log. e.g. when we debug the problem with server log, it is hard to know which step has failed because there is no indication. Operation Failed will be the indicator at which step has been failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being removed in this PR, but being abstracted out as I also had explained in the PR description.

@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result {
}

// NewFailedResult creates a new Result object with the given error details and sets the failed flag to true.
func NewFailedResult(err v1.ErrorDetails) Result {
func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to have err arg here? The initial idea is that we want to have a detail error description specified by Controller instead of just golang error. So we can deliver the detail error to clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see how it is consumed by this function to understand the reasoning behind adding it in parameters here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

JFYI, all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended. Let's chat offline if you need more info here.

  • Frontend controller interface

type Controller interface {
// Run executes the operation.
Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error)
}

  • Backend controller interface

Run(ctx context.Context, request *Request) (Result, error)

Copy link
Contributor Author

@kachawla kachawla Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there.

I do think this should be added to cancelled requests, the same problem exists there. I intentionally left it out of the first iteration of this PR to agree on the approach and then extend it to cancelled.

I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

I did call out the why in other comments. Here is one problem I called out #7296 (comment), another here: #7296 (comment). Can you provide some context on why the original design was implemented expecting each controller to log the error?

all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended.

I understand that. This PR is not changing controller interfaces or the result, it is purely abstracting out logging into where the result is set so that it is guaranteed to be logged. If NewFailedResult/SetFailed is called, there should be a corresponding error that is causing it to get to this state. If the same goal can be achieved by moving the logging to async worker, that's fine by me, as long as we have a shared component/piece of code that guarantees logging, and isn't leaving it to human memory.

Let's use the next design meeting to understand and figure out the path forward on this.

Copy link

@youngbupark youngbupark Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error was coming from the following code in the original issue,

return "", fmt.Errorf("could not find API version for type %q, type was not found", id.Type())

All errors related to bicep should be caught in the following code first and logged before reaching worker's error handling code.

err = d.ResourceClient.Delete(ctx, id)
if err != nil {
if attempt <= d.options.DeleteRetryCount {
logger.V(ucplog.LevelInfo).Error(err, "attempt failed", "delay", d.options.DeleteRetryDelaySeconds)
time.Sleep(time.Duration(d.options.DeleteRetryDelaySeconds) * time.Second)
continue
}
return recipes.NewRecipeError(recipes.RecipeDeletionFailed, err.Error(), "", recipes.GetErrorDetails(err))
}
// If the err is nil, then the resource is deleted successfully
logger.V(ucplog.LevelInfo).Info(fmt.Sprintf("Deleted output resource: %q", id))
return nil
}

But I could not find this error logs in the log files. It means that the request wasn't processed by worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the test that failed doesn't use a recipe and won't go inside the driver code, which is why the log above isn't present. Looking through the logs and all the resources that the failed test creates, it failed during deletion of the container resource and the error in the test failure is likely coming from here. I'm doing further investigation to see why the error wasn't logged.

The trace id that I see in the logs for deletion is different from the trace id in the functional test failure logs - so that's another area which needs further digging.

{"severity":"info","timestamp":"2024-03-04T21:30:59.433Z","name":"radius.radiusasyncworker","caller":"deployment/deploymentprocessor.go:341","message":"Deleting output resource: LocalID: KubernetesRole, resource type: \"Provider: kubernetes, Type: rbac.authorization.k8s.io/Role\"\n","serviceName":"radius","version":"edge","hostName":"applications-rp-7b774d7494-k6dv2","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/dapr-sts-manual-ctnr","operationId":"489b1b1a-36dc-4219-bbd5-48ef3d23e90d","operationType":"APPLICATIONS.CORE/CONTAINERS|DELETE","dequeueCount":1,"traceId":"6cde20fc5dcbe244c35799b5ff94fea6","spanId":"35daee63ef2a07aa"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that it happened while deleting secret as part of container deletion. Please see the following log:
image

It went further after deleting secret resource.

return r
}

// SetFailed sets the error response with Failed status.
func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

@kachawla kachawla Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used on L57.

@kachawla
Copy link
Contributor Author

This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Can you explain this in more detail, especially the part about errors being propagated vs not propagated?

Async errors aren't returned. I had linked the documentation https://github.com/radius-project/radius/blob/main/docs/contributing/contributing-code/contributing-code-control-plane/logging.md?plain=1#L45. It should be updated if it isn't providing clear understanding.

Are there cases we return a failure from an async operation and it's not logged?

Yes. Current implementation expects separate explicit error logging if an async operation fails, which is the reason why log statements like these are spread across our code base. Current approach is unreliable because it clearly has left us with logging gaps, and this is not the first time it has prevented us from debugging, past example here. I'm sure @nithyatsu and @youngbupark can add more historical context on this choice of implementation.

Logging error for that case is important to understand server-side issue because retry failure could be caused by server side problem. That's why we not only return the error to client but also record the error on server-side log.

Well that's the problem I'm fixing in this PR. Errors aren't being logged today preventing us from being able to effectively debug issues. We have seen many instances of this, here is an example for your reference #7264

@kachawla
Copy link
Contributor Author

I'm trying to figure out whether we've got the wrong design or whether this is "just" a bug 🐛 FWIW, I wrote the guidance on logging - so you can rest the blame on me if this is the wrong idea. I'd be happy to spend some time discussing and coming up with better guidance. I've found that error handling in Go is usually the most challenging thing for people to get right.
In the case we're talking about, I'd call this region of the code the place "where the error is handled".

Does the region of code you linked handle this scenario as well where the error returned from the controller is nil with a failed result?

Is the problem that there are two ways to represent a failure?

The problem is that the errors aren't getting logged. There is no way to get the stack trace for debugging. If the root cause identified here is not correct, then this needs further investigation on why we are dropping logs for certain requests.

This log is also odd https://github.com/radius-project/radius/blob/main/pkg/armrpc/asyncoperation/worker/worker.go#L260. Why is it always hardcoding result.Error to nil? @youngbupark do you have context on what was the motivation here or is it a bug? If it is a bug then I think updating it will be a better approach over what I'm doing in this PR.

It is intended.

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

If the operation is succeeded, result.Error must be nil. When logger processes this, it will be represented as JSON format log like below. The arguments from "success" until err are treated as a log dimension or attribution. Having such attributions such as success/code/provisioningState help indexing for search backend.

{"severity":"info","timestamp":"2024-03-12T20:43:42.002Z","name":"radius.radiusasyncworker","caller":"worker/worker.go:260","message":"Operation returned","serviceName":"radius","version":"edge","hostName":"applications-rp-bf77896d9-rfnk7","resourceId":"/planes/radius/local/resourcegroups/kind-radius/providers/Applications.Core/containers/backendcontainerdns","operationId":"64d729f1-2936-48eb-aabe-9a5fcaee2626","operationType":"APPLICATIONS.CORE/CONTAINERS|PUT","dequeueCount":1,"traceId":"9684bed9d6f01c4db253ab973274dfd1","spanId":"a15c41cd88fea0e1","success":true,"code":"","provisioningState":"Succeeded","err":null}

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

@@ -40,20 +44,24 @@ func NewCanceledResult(message string) Result {
}

// NewFailedResult creates a new Result object with the given error details and sets the failed flag to true.
func NewFailedResult(err v1.ErrorDetails) Result {
func NewFailedResult(ctx context.Context, err error, errorDetails v1.ErrorDetails) Result {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think printing out error in this helper is the right approach. If we add this approach, then we need to add the similar code to cancelled case as well. We might not want to have the same logging code here and there. I think we need to first figure it out the root cause about why error is not logged in main worker loop before making this change.

JFYI, all these helpers in this package are to create ctrl.Result with ARM compatible error like we use rest.Response in frontend controller. In the design of controllers, we want to have the consistent controller interfaces between frontend and backend controllers. In short, when we want to return the detail error response, we can return rest.Response for frontend and ctrl.Result for backend. If the error is internal error or hard to be categorized, controller can return nil, err. (err is go error), which is not recommended. Let's chat offline if you need more info here.

  • Frontend controller interface

type Controller interface {
// Run executes the operation.
Run(ctx context.Context, w http.ResponseWriter, req *http.Request) (rest.Response, error)
}

  • Backend controller interface

Run(ctx context.Context, request *Request) (Result, error)

func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {
logger := ucplog.FromContextOrDiscard(ctx)
logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, if we want to log out for the failure, cancellation is also failure. we need to put the same code there. We might not want to have the same code here and there. So we have logged the failure or status in main worker loop. I think we first need to figure it out why the current logging is not working as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we first need to figure it out why the current logging is not working as expected.

Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places. The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we first need to figure it out why the current logging is not working as expected.

Some details on why it is not working is here: #7297 (comment). Current design expects each controller to individually log an error, which isn't reliable as it requires specific knowledge and clearly hasn't worked well with misses in multiple places.

To be clear, we have logged error individually in the main worker loop, not in controller. When controller creates Result, it needs to be expected to return Result and error. then the following code must log out the error. This the centralized logging part for all controllers.

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)
code := ""
if result.Error != nil {
code = result.Error.Code
}
logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

The approach in this PR is one way to fix it. How would you recommend fixing this gap? If the root cause is something else according to your understanding, then please add details on that.

I think the PR is the workaround. We first need to understand the root cause and then have the right fix here. in this comments - #7297 (comment), we might not be able to repro the same issue.

I do not know what happens. but it might have numerous reasons. For example, RP could crash before logging. Another hypothesis is that when workflow step collects the logs, the log might be missing. Functional tests save the RP logs by using kubectl logs, which is not reliable.

Copy link
Contributor Author

@kachawla kachawla Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngbupark Let me ask this to build a better understanding of the current state -

  1. Can you share link to where the code is that logs the error caught here?

  2. Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngbupark Let me ask this to build a better understanding of the current state -

  1. Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

  1. Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngbupark Let me ask this to build a better understanding of the current state -

  1. Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.

  1. Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

My question is - is this redundant?

Copy link

@youngbupark youngbupark Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngbupark Let me ask this to build a better understanding of the current state -

  1. Can you share link to where the code is that logs the error caught here?

This is a backend controller. asyncCtrl.Run calls func (c *DeleteResource[P, T]) Run(ctx context.Context, request *ctrl.Request) (ctrl.Result, error) {.

result, err := asyncCtrl.Run(asyncReqCtx, asyncReq)

For additional code flow, I recommend to debug the code with the debugger.

I'm aware of where it is called from. I'm asking where is the error logged for the scenario I linked above.

There are two places:

logger.Info("Operation returned", "success", result.Error == nil, "code", code, "provisioningState", result.ProvisioningState(), "err", err)

logger.Error(err, "Operation Failed")

And I strongly recommend to debug the code flow to understand the details.

  1. Is this log statement unnecessary https://github.com/radius-project/radius/blob/main/pkg/portableresources/backend/controller/createorupdateresource.go#L100?

This is a server-side code. It may or may not. This is the error logging in controller. If controller author thinks it is useful with the better message and print out the error earlier, they can add it.

My question is - is this redundant?

As I said, "It may or may not". In general, If the log is useful for debugging the server, then it should not be redundant. From what I read, it is more readable than just printing out pure go error. I do not think it is redundant. To understand why, I would reach out to yetkin who contribute to this code.

Copy link
Contributor Author

@kachawla kachawla Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are logging the same error twice, then that is redundant imo. It was added originally because the error wasn't getting logged anywhere else (same problem that this PR is trying to solve).

@youngbupark
Copy link

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

@kachawla
Copy link
Contributor Author

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

@youngbupark
Copy link

youngbupark commented Mar 18, 2024

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

I do not have the strong opinion to use logger.Error and logger.Info separately or not here. If you think using the different logger helper method is right for error case, it might be good to add logger.Error to this code instead of changing the other codes including NewFailedResult helper.

BTW, I dug it out the original issue, please see my comment - #7297 (comment)

@kachawla
Copy link
Contributor Author

The problem with this code that I'm highlighting is that this log isn't exercised only for success scenarios, the current implementation updates the value for code here in case of an error, and then adds it here with the impression that the request has succeeded.

Then I would change this code only to use logger.Error for error case and use logger.Info for normal scenario.

I'm not following this. Are you agreeing that there is a problem with the code I linked above?

I do not have the strong opinion to use logger.Error and logger.Info separately or not here. If you think using the different logger helper method is right for error case, it might be good to add logger.Error to this code instead of changing the other codes including NewFailedResult helper.

BTW, I dug it out the original issue, please see my comment - #7297 (comment)

Ok I will look into fixing it in async worker. Will sync up with you offline on these questions I have for my clarity #7296 (comment)

@kachawla
Copy link
Contributor Author

There are multiple problems that clearly need further investigation (missing trace id, logs) so closing this PR at this point. I'll look into the next steps and either create follow up issues or PR.

@kachawla kachawla closed this Mar 21, 2024
func (r *Result) SetFailed(err v1.ErrorDetails, requeue bool) {
func (r *Result) SetFailed(ctx context.Context, err error, errorDetails v1.ErrorDetails, requeue bool) {
logger := ucplog.FromContextOrDiscard(ctx)
logger.Error(err, fmt.Sprintf("Operation Failed: %s", errorDetails.Message))
Copy link

@youngbupark youngbupark Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another downside to add logging here is that whenever backend controller calls this method, it will log out. Basically, backend controller devs wants to print out log when the error is ready. All these helpers is to construct and set the failure reason, not print out the log. adding logging here doesn't seem right..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log Errors for Failed Requests
3 participants