Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress Info in Asyncworker #34

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions architecture/2023-12-asyncworker-progressInfo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Title

* **Status**: Pending
* **Author**: Vinaya Damle (@vinayada1)

## Overview

Currently async worker has no knowledge of the operation progress till the operation is complete. If the operation times out, there is no mechanism in the current design that any information about the operation progress so far can be communicated to the user. As a result, the user only sees the error message that the operation timed out.
Copy link
Contributor

@rynowak rynowak Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the ability to indicate the progress as a percentage, but I agree no real way to provide user-facing messages.

I think the ability to indicate progress is separate from the question of: "What error message do users see when an operation fails?"

This sounds like a useful feature, but we should have the debate about whether to use it for this.


The proposal here is to add a mechanism so that the running operation can send out progress information to the async worker. As a result, when the deployment operation eventually times out, along with the timed out message, the user will also be able to see the operation progress so far in the error message which could help in troubleshooting the deployment failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this idea is pretty cool overall.

I'm curious what happened to the original idea? https://github.com/radius-project/design-notes/pull/22/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to implement one of those in that design doc - detect readiness failures by looking at pod events and looks like we need this to be implemented to actually communicate the events to the user. Without this, even though the code can detect the failures, the user still sees - operation timed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, even though the code can detect the failures, the user still sees - operation timed out.

Why is that the case? If the code inside the container operation implemented the timeout, it can control how the error reporting functions.


## Terms and definitions

AsyncWorker - The current architecture which assigns workers to process queued long-running jobs asynchronously.

## Objectives

> **Issue Reference:** https://github.com/radius-project/radius/issues/6835

### Goals
- Enable reporting of operation progress in async worker

### Non-Goals
- Modify reporting of operation progress in CLI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can say that this is out of scope of this design doc, but we haven't delivered any value to users until you can see progress in the CLI or other UX.


### User scenarios (optional)

#### User story 1

The readiness probe for the application pod fails. This is a non-terminal failure condition. The current code keeps waiting for the deployment to complete and eventually the deployment times out. The error message to the user is "Deployment timed out after xxx seconds".

If we implement https://github.com/radius-project/radius/issues/6284 and look at pod events, we can detect the readiness probe failure. However, we still need a way to report this information to the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what this would look like? eg: if we deployed a container what set of progress messages would we create?

Would this become the default behavior for all of our async operations? eg: when an operation times out we'd build an error message from the progress updates?


#### User story 2

## Design

In the current design, the following is used to track operation status:-

```
// AsyncOperationStatus represents an OperationStatus resource.
type AsyncOperationStatus struct {
// Id represents the async operation id.
ID string `json:"id,omitempty"`

// Name represents the async operation name and is usually set to the async operation id.
Name string `json:"name,omitempty"`

// Status represents the provisioning state of the resource.
Status ProvisioningState `json:"status,omitempty"`

// StartTime represents the async operation start time.
StartTime time.Time `json:"startTime,omitempty"`

// EndTime represents the async operation end time.
EndTime *time.Time `json:"endTime,omitempty"`

// Error represents the error occurred during provisioning.
Error *ErrorDetails `json:"error,omitempty"`
}
```

### Design details

We will add the following field to AsyncOperationStatus to report the operation progress:-

```
// AsyncOperationStatus represents an OperationStatus resource.
type AsyncOperationStatus struct {
....

// ProgressEvents represents the progress of the async operation.
// The events are represented as a map of key-value pairs where the key is
// the event type such as Info, Error and the value is the event message.
ProgressEvents map[string]string `json:"progressEvents,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does client gets this ProgressEvents? What's the protocol to get this error as client perspective? In ARM, operationstatuses api is the interface to get the status of async operation.

Therefore, I do not think we need to add new property for process event. The existing errors.details array property is enough to capture all in-progress error event. See this error format

	// Error represents the error occurred during provisioning.
	Error *ErrorDetails `json:"error,omitempty"`

In terms of implementation, we need to pass additional error channel to async worker controller to allow the controller to update the live error. then consumer side can keep adding the error into error.Details array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that work in the case of success? My concern is that clients will be confused by the presence of the error field for a success case.

Copy link

@youngbupark youngbupark Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Ah make sense. Then, we can consider using operations array property. What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we going to get the ProgressEvents from? Do we need a formatter for this so that the end user can see it properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware there was an existing API for this. If there is an existing API, we should do another iteration with the proposed usage.


....
}
```

In the ProgressEvents map, the key represents the event type - Info or Error and the value represents the message string. An info event could be something like "All containers for pod are ready". The readiness probe failure described earlier could be reported as an error event.

When the operation times out, with this change, the user should now be able to see an error message similar to:-
"Operation (APPLICATIONS.CORE/CONTAINERS|PUT) has timed out because it was processing longer than xx s. Progress events:
Info - Container ctnr-bad-readiness is running but not ready
Error - Container failed readiness probe. Reason: Unhealthy, Message: Readiness probe failed: Get \"http://10.244.0.10:5000/bad\": dial tcp 10.244.0.10:5000: connect: connection refused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does order matter?

The way that this is defined there's a single message allowed for each severity level. That's not straightforward to understand. A design that's more like a log (sequence of events) is what I would expect.


### API design (if applicable)

NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely applicable :) this whole item is API design.


## Alternatives considered

The following field in the AsyncOperationStatus is not really being used right now:-

```
// Error represents the error occurred during provisioning.
Error *ErrorDetails `json:"error,omitempty"`
```
We could repurpose it to report failures on the async operation. However, adding an explicit field on the status to report progress seems like a better option to report even informational progress and not just errors.

## Test plan

<!--
Include the test plan to validate the features including the areas that
need functional tests.

Describe any functionality that will create new testing challenges:
- New dependencies
- External assets that tests need to access
- Features that do I/O or change OS state and are thus hard to unit test
-->

## Security

NA


## Monitoring

NA

## Development plan

- Implement the design
- Unit tests

## Open issues

NA
Loading