Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Scalar UDF support #222

Merged
merged 45 commits into from
Sep 23, 2024
Merged

Conversation

taniabogatsch
Copy link
Collaborator

@taniabogatsch taniabogatsch commented May 31, 2024

This is a draft PR. I built with duckdb's feature branch, as scalar UDFs are not yet part of the released C API. The test failures are related to that.

Because of that, and because this PR depends on #219, there are a lot of file changes in this PR, as well as separate libraries. They can all be ignored. The relevant files are listed below.

  • scalar_udf.go
  • scalar_udf_test.go

@JAicewizard, I solved passing types differently than you did in #201. What do you think? Is passing SQL type names a sensitive idea? I also tried to solely use the DataChunk API draft's functions.

In later PRs, we can extend this with ExecuteColumn.

Here is the example included in this PR.

type scalarUDF struct {
	err error
}

func (udf *scalarUDF) Config() ScalarFunctionConfig {
	return ScalarFunctionConfig{
		InputTypes: []string{"INT", "INT"},
		ResultType: "INT",
	}
}

func (udf *scalarUDF) ExecuteRow(args []driver.Value) (any, error) {
	if len(args) != 2 {
		return nil, errors.New("expected two values")
	}
	val := args[0].(int32) + args[1].(int32)
	return val, nil
}

func (udf *scalarUDF) SetError(err error) {
	udf.err = err
}

func TestScalarUDFPrimitive(t *testing.T) {
	db, err := sql.Open("duckdb", "")
	require.NoError(t, err)

	c, err := db.Conn(context.Background())
	require.NoError(t, err)

	var udf scalarUDF
	err = RegisterScalarUDF(c, "my_sum", &udf)

	var msg int
	row := db.QueryRow(`SELECT my_sum(10, 42) AS msg`)
	require.NoError(t, row.Scan(&msg))
	require.Equal(t, 52, msg)
	require.NoError(t, db.Close())
}

@JAicewizard
Copy link
Contributor

JAicewizard commented May 31, 2024

@JAicewizard, I solved passing types differently than you did in #201. What do you think? Is passing SQL type names a sensitive idea?

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type. Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies. Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

@JAicewizard
Copy link
Contributor

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

scalar_udf.go Outdated Show resolved Hide resolved
@taniabogatsch
Copy link
Collaborator Author

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

Yes, your PR was a great help to write this! And I noticed the same, there is significantly more logic around the table UDFs.

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type.

Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

The parser is a fair argument against this. We currently have func logicalTypeName(lt C.duckdb_logical_type) string {...}, which does a bit of that, but it would have to be much improved. Additionally, the second argument is a strong one for me. If we stick to the go types and go-duckdb types, then they match the casts from any in the execution function. Maybe we can draft a Type interface in a separate PR, though? As it is a general concept that we want to introduce for UDFs... And it should integrate well with the existing exposed types... 🤔

Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies.

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

@taniabogatsch taniabogatsch added the feature [feature] request or PR label Jun 6, 2024
@JAicewizard
Copy link
Contributor

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

true, but for, for example, table UDFs, the types returned types are only determined at bind, it would turn into a generic sql error. But this is definitely debatable, its just my opinion. A seperate PR could be appropriate, I don't have much time due to exams next week. Feel free to copy my implementation if you want, also keep in mind we can add new "constructors" whenever we want, if some duckdb API comes out to parse a string into a type we can do this. (we can even make type an interface)

taniabogatsch and others added 3 commits September 9, 2024 13:08
# Conflicts:
#	Makefile
#	appender.go
#	data_chunk.go
#	deps/darwin_amd64/libduckdb.a
#	deps/darwin_arm64/libduckdb.a
#	deps/freebsd_amd64/libduckdb.a
#	deps/linux_amd64/libduckdb.a
#	deps/linux_arm64/libduckdb.a
#	errors.go
#	types.go
#	vector.go
# Conflicts:
#	Makefile
#	deps/darwin_amd64/libduckdb.a
#	deps/darwin_arm64/libduckdb.a
#	deps/freebsd_amd64/libduckdb.a
#	deps/linux_amd64/libduckdb.a
#	deps/linux_arm64/libduckdb.a
#	errors.go
#	types.go
@taniabogatsch
Copy link
Collaborator Author

@JAicewizard, somehow this PR broke along the way.
The pointer I now receive in the callback does not longer contain the correct handle.
f1d079a

I also tried pinning (with runtime.Pinner), without success. Maybe I am missing something obvious? Any help would be appreciated.

@taniabogatsch
Copy link
Collaborator Author

taniabogatsch commented Sep 11, 2024

Never mind, it is most likely a bug introduced somewhere in the C API, as reverting to an older duckdb build (seems) to fix it. I.e., the same code works with b63142c.
EDIT: Not a bug, I missed the changes in duckdb/duckdb#12663.

@JAicewizard
Copy link
Contributor

Haha Yeah debugging these kind of issues it not fun. Nice that it is fixed. I will look for similar changes in duckdb for the table UDFs when rebasing my branch.

Copy link
Contributor

@JAicewizard JAicewizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice, it looks a bit messy with all the type_info changes in here as well, but I think this looks good to be merged if the comments are addressed.

Optionally you could add parallel and chunk APIs as well, but I don't think it has any unfixable implications besides what I already mentioned.

vector.go Outdated Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
@taniabogatsch
Copy link
Collaborator Author

Thanks for your review!

Optionally you could add parallel and chunk APIs as well, but I don't think it has any unfixable implications besides what I already mentioned.

I am unsure what you mean with the parallel API, I need to check with your table UDF PR again.
I'll add the chunk API 👍 Even though we are still missing helper functions like SetColumn.

@taniabogatsch
Copy link
Collaborator Author

I also merged the type interface changes in main, so hopefully, it is less messy now.

@JAicewizard
Copy link
Contributor

I am unsure what you mean with the parallel API, I need to check with your table UDF PR again.

I cannot find the scalar function API online at all, so I can't check ATM, but table functions can specify their max threads that they can execute on, and executing on more than one thread of course requires being aware of this and handling local vs global state (thus I implemented them as different types). I don't know if something similar exists for scalar functions

I'll add the chunk API 👍 Even though we are still missing helper functions like SetColumn.

Ah yeah I see the problem, I currently use chunk.SetValue(i, 0, d.count) in the chunk variant, so I still set the individual values. This is surprisingly fast, so I don't see a reason we would need a SetColumn right now, although it is of course good to implement this in the future.

@taniabogatsch
Copy link
Collaborator Author

This is surprisingly fast, so I don't see a reason we would need a SetColumn right now, although it is of course good to implement this in the future.

Yes, I achieved speed-ups on this with #254.
We hold the memory pointers in the vector, so SetValue performers direct access to them without too much function call overhead, etc. It is not quite at the level of vectorized execution over the chunk, but I am also happy with the performance.

@taniabogatsch
Copy link
Collaborator Author

I cannot find the scalar function API online at all, so I can't check ATM, but table functions can specify their max threads that they can execute on, and executing on more than one thread of course requires being aware of this and handling local vs global state (thus I implemented them as different types). I don't know if something similar exists for scalar functions

Ah yes, I've been working with duckdb.h directly, which contains the functions.
I opened an issue here to add the scalar functions to the documentation: duckdb/duckdb-web#3679.

@JAicewizard
Copy link
Contributor

The table function documentation is also broken ATM, I will open an issue about that too.

But I just realised that scalar functions probably don't need any state at all, so there probably is no max threads for scalar functions.

@taniabogatsch
Copy link
Collaborator Author

taniabogatsch commented Sep 19, 2024

Duckdb internally parallelizes scalar function execution, and each chunk is independently executed (on different threads, if available). So, indeed, we do not need states here.

I went over your feedback and pushed some changes.

  • Clean-up pass.
  • Documentation of exported structs.
  • Scalar function example(s).

Could you give another review (after these remaining steps)?

scalar_udf.go Outdated Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
scalar_udf.go Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
scalar_udf.go Outdated Show resolved Hide resolved
@taniabogatsch taniabogatsch marked this pull request as ready for review September 19, 2024 13:39
@taniabogatsch
Copy link
Collaborator Author

Alright, I've implemented the review feedback (thanks!), and from my side, this should be ready to go in.
What do you think @JAicewizard?

@taniabogatsch taniabogatsch merged commit aa90038 into marcboeker:main Sep 23, 2024
4 checks passed
@taniabogatsch taniabogatsch deleted the scalar branch September 23, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature [feature] request or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants