Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Errors for Failed Requests #7297

Closed
kachawla opened this issue Mar 9, 2024 · 7 comments
Closed

Log Errors for Failed Requests #7297

kachawla opened this issue Mar 9, 2024 · 7 comments
Assignees
Labels
bug Something is broken or not working as expected triaged This issue has been reviewed and triaged

Comments

@kachawla
Copy link
Contributor

kachawla commented Mar 9, 2024

Problem

We currently lack error logging upon failure for many operations which hinders our ability to debug failures. This gap exists because errors that aren't propagated back up (async operations) aren't logged automatically. The current state of the code is inconsistent, with some operations logging errors while others do not.

Here is an example of where error was not logged: #7264

Proposed Solution

To address this, the proposed change is to move error logging to the setFailed method. By doing so, we ensure that error logging is not reliant on every individual operation to log errors.

AB#11425

@kachawla kachawla added the bug Something is broken or not working as expected label Mar 9, 2024
@radius-triage-bot
Copy link

👋 @kachawla Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

@nithyatsu
Copy link
Contributor

nithyatsu commented Mar 11, 2024

@kachawla , is the issue easy to reproduce? if so, could you please share the steps with me, I will look into this. I went through the PR's approach of propagating the error. I am trying to understand why the logs get dropped.

@kachawla
Copy link
Contributor Author

@kachawla , is the issue easy to reproduce? if so, could you please share the steps with me, I will look into this. I went through the PR's approach of propagating the error. I am trying to understand why the logs get dropped.

@nithyatsu thanks for looking into this. Here is an example of where it happened: #7264. So if we run into an error on the delete resource path for portable resources, we will run into this. This gap exists on core rp resource creation/deletion as well should also be reproducible by injecting error on any of those paths as well. It is basically similar issue as what you fixed for create/update requests for portable resources here: #6276

@sylvainsf sylvainsf added the triaged This issue has been reviewed and triaged label Mar 11, 2024
@radius-triage-bot
Copy link

👍 We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

@kachawla kachawla self-assigned this Mar 11, 2024
@youngbupark
Copy link

This is the error log from functional tests. This is coming from k8s API server and sdk. Unless we can get all logs from k8s related pods, it is hard to figure it out why. And it is also difficult to repro. I am sure that Radius was working as expected, but K8s API server was not working as expected. If we could see the similar issue in long-running test, we would be able to get more details from log analytics because long-running test logs all k8s related pods.

2024/03/05 04:16:27 Start streaming Kubernetes logs - Pod dapr-frontend-5448d95787-8dhhg is in state: Failed
    cli.go:418: [rad] Error: {
    cli.go:418: [rad]   "code": "Internal",
    cli.go:418: [rad]   "message": "could not find API version for type \"core/Secret\": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

@kachawla
Copy link
Contributor Author

kachawla commented Mar 18, 2024

This is the error log from functional tests. This is coming from k8s API server and sdk. Unless we can get all logs from k8s related pods, it is hard to figure it out why. And it is also difficult to repro. I am sure that Radius was working as expected, but K8s API server was not working as expected. If we could see the similar issue in long-running test, we would be able to get more details from log analytics because long-running test logs all k8s related pods.

2024/03/05 04:16:27 Start streaming Kubernetes logs - Pod dapr-frontend-5448d95787-8dhhg is in state: Failed
    cli.go:418: [rad] Error: {
    cli.go:418: [rad]   "code": "Internal",
    cli.go:418: [rad]   "message": "could not find API version for type \"core/Secret\": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

I had seen this log already :), had added it here since logging done by functional tests was available #7264 (comment). But if you search through the RP logs (attached in this comment #7264 (comment)), this error isn't logged, which is a gap that I have seen in other places as well and am trying to figure out. I'm sure we can reproduce the logging issue by injecting any client side error, and confirm if logging is really being skipped. I agree that the particular error we saw in the functional test failure isn't easily reproducible.

@kachawla
Copy link
Contributor Author

kachawla commented Mar 27, 2024

Fixed this in #7370 and created #7369 with suggestions on further improvements in the async worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not working as expected triaged This issue has been reviewed and triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants