Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udf): support always_retry_on_network_error config for udf functions #15163

Merged
merged 11 commits into from
Feb 27, 2024

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented Feb 20, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Closes #15137.

The main issue is that when UDF call fails, the function call returns null instead.
If the function call updates a key, e.g. for join or group by, it is possible for the state to be inconsistent.

As such we provide the option to define a udf function which will always retry on network errors, since we cannot tolerate the UDF failing non-deterministically in such cases.

Here are the changes:

  1. Introduce a new CreateFunctionWithOptions.
  2. Since it's an AST object, we provide option for each of its fields. This is to make it compatible to Display (otherwise no WITH options, or only some options set, we still display all CreateFunctionWithOptions. See tests for examples). This also preserves the AST structure.
  3. Flatten the CreateFunctionWithOptions into parameters to Function, UserDefinedFunction definitions.
  4. Initialize the udf expr with always_retry_on_network_error.
  5. Add tests for display of CreateFunctionWithOptions.
  6. Add tests to test MV contents when there's no retry, vs when there's retry.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Users can now do:

CREATE FUNCTION ... WITH ( always_retry_on_network_error = true );

This means network errors will always be retried for function calls of that function.

Note that the entire stream graph will be blocked when UDF server goes offline, until it is back online.

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: What's the relationship with #15171?

@kwannoel
Copy link
Contributor Author

kwannoel commented Feb 22, 2024

qq: What's the relationship with #15171?

This PR makes it configurable as an with_option.

In the other PR, it just hardcodes it to always retry. Reason being I want to introduce as little code as possible to the user's cluster to avoid breakage, and their UDFs must always retry.

@kwannoel
Copy link
Contributor Author

Can @xxchan and @wangrunji0408 PTAL at this? I want to include this in 1.7. Since it's a breaking change, so users can upgrade their UDF functions ASAP, since they have to recreate all dependent MVs.

@kwannoel
Copy link
Contributor Author

Can review the code first, there's some separate failure due to some migration issues.

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I'm not sure what's the best way, but I think offering this option doesn't hurt.

@@ -126,7 +128,12 @@ impl UserDefinedFunction {
UdfImpl::JavaScript(runtime) => runtime.call(&self.identifier, &input)?,
UdfImpl::External(client) => {
let disable_retry_count = self.disable_retry_count.load(Ordering::Relaxed);
let result = if disable_retry_count != 0 {
let result = if self.always_retry_on_network_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put all disable_retry_count related stuff in else branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm no difference to me? Seems to add more nesting.

Copy link
Member

@xxchan xxchan Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mixing them together can lead to confusion. Why do we still update disable_retry_count when always_retry_on_network_error?

If nesting doesn't look good, we can add a sth like call_with_ disable_retry_count...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't get what you meant originally. Now I do. Updated it.

Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

src/expr/udf/src/external.rs Outdated Show resolved Hide resolved
@kwannoel kwannoel requested a review from xxchan February 27, 2024 06:55
@kwannoel kwannoel force-pushed the kwannoel/always-retry-failed-udf branch from 45e305a to 44bd236 Compare February 27, 2024 07:00
@kwannoel kwannoel enabled auto-merge February 27, 2024 07:39
@kwannoel kwannoel added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
@lmatz lmatz added this pull request to the merge queue Feb 27, 2024
Merged via the queue into main with commit de3696f Feb 27, 2024
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. type/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add always retry config for UDF
5 participants