[Feature] Scalar UDF support #222

taniabogatsch · 2024-05-31T13:50:24Z

This is a draft PR. I built with duckdb's feature branch, as scalar UDFs are not yet part of the released C API. The test failures are related to that.

Because of that, and because this PR depends on #219, there are a lot of file changes in this PR, as well as separate libraries. They can all be ignored. The relevant files are listed below.

scalar_udf.go
scalar_udf_test.go

@JAicewizard, I solved passing types differently than you did in #201. What do you think? Is passing SQL type names a sensitive idea? I also tried to solely use the DataChunk API draft's functions.

In later PRs, we can extend this with ExecuteColumn.

Here is the example included in this PR.

type scalarUDF struct {
	err error
}

func (udf *scalarUDF) Config() ScalarFunctionConfig {
	return ScalarFunctionConfig{
		InputTypes: []string{"INT", "INT"},
		ResultType: "INT",
	}
}

func (udf *scalarUDF) ExecuteRow(args []driver.Value) (any, error) {
	if len(args) != 2 {
		return nil, errors.New("expected two values")
	}
	val := args[0].(int32) + args[1].(int32)
	return val, nil
}

func (udf *scalarUDF) SetError(err error) {
	udf.err = err
}

func TestScalarUDFPrimitive(t *testing.T) {
	db, err := sql.Open("duckdb", "")
	require.NoError(t, err)

	c, err := db.Conn(context.Background())
	require.NoError(t, err)

	var udf scalarUDF
	err = RegisterScalarUDF(c, "my_sum", &udf)

	var msg int
	row := db.QueryRow(`SELECT my_sum(10, 42) AS msg`)
	require.NoError(t, row.Scan(&msg))
	require.Equal(t, 52, msg)
	require.NoError(t, db.Close())
}

JAicewizard · 2024-05-31T22:19:06Z

@JAicewizard, I solved passing types differently than you did in #201. What do you think? Is passing SQL type names a sensitive idea?

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type. Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies. Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

JAicewizard · 2024-05-31T22:23:08Z

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

scalar_udf.go

taniabogatsch · 2024-06-06T08:23:34Z

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

Yes, your PR was a great help to write this! And I noticed the same, there is significantly more logic around the table UDFs.

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type.

Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

The parser is a fair argument against this. We currently have func logicalTypeName(lt C.duckdb_logical_type) string {...}, which does a bit of that, but it would have to be much improved. Additionally, the second argument is a strong one for me. If we stick to the go types and go-duckdb types, then they match the casts from any in the execution function. Maybe we can draft a Type interface in a separate PR, though? As it is a general concept that we want to introduce for UDFs... And it should integrate well with the existing exposed types... 🤔

Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies.

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

JAicewizard · 2024-06-06T09:33:35Z

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

true, but for, for example, table UDFs, the types returned types are only determined at bind, it would turn into a generic sql error. But this is definitely debatable, its just my opinion. A seperate PR could be appropriate, I don't have much time due to exams next week. Feel free to copy my implementation if you want, also keep in mind we can add new "constructors" whenever we want, if some duckdb API comes out to parse a string into a type we can do this. (we can even make type an interface)

# Conflicts: # Makefile # appender.go # data_chunk.go # deps/darwin_amd64/libduckdb.a # deps/darwin_arm64/libduckdb.a # deps/freebsd_amd64/libduckdb.a # deps/linux_amd64/libduckdb.a # deps/linux_arm64/libduckdb.a # errors.go # types.go # vector.go

# Conflicts: # Makefile # deps/darwin_amd64/libduckdb.a # deps/darwin_arm64/libduckdb.a # deps/freebsd_amd64/libduckdb.a # deps/linux_amd64/libduckdb.a # deps/linux_arm64/libduckdb.a # errors.go # types.go

taniabogatsch · 2024-09-11T16:27:04Z

@JAicewizard, somehow this PR broke along the way.
The pointer I now receive in the callback does not longer contain the correct handle.
f1d079a

I also tried pinning (with runtime.Pinner), without success. Maybe I am missing something obvious? Any help would be appreciated.

taniabogatsch · 2024-09-11T17:11:42Z

Never mind, it is most likely a bug introduced somewhere in the C API, as reverting to an older duckdb build (seems) to fix it. I.e., the same code works with b63142c.
EDIT: Not a bug, I missed the changes in duckdb/duckdb#12663.

JAicewizard · 2024-09-12T12:43:41Z

Haha Yeah debugging these kind of issues it not fun. Nice that it is fixed. I will look for similar changes in duckdb for the table UDFs when rebasing my branch.

JAicewizard

This looks very nice, it looks a bit messy with all the type_info changes in here as well, but I think this looks good to be merged if the comments are addressed.

Optionally you could add parallel and chunk APIs as well, but I don't think it has any unfixable implications besides what I already mentioned.

vector.go

scalar_udf.go

taniabogatsch · 2024-09-18T11:20:02Z

Thanks for your review!

Optionally you could add parallel and chunk APIs as well, but I don't think it has any unfixable implications besides what I already mentioned.

I am unsure what you mean with the parallel API, I need to check with your table UDF PR again.
I'll add the chunk API 👍 Even though we are still missing helper functions like SetColumn.

taniabogatsch · 2024-09-18T11:20:55Z

I also merged the type interface changes in main, so hopefully, it is less messy now.

JAicewizard · 2024-09-18T11:39:51Z

I am unsure what you mean with the parallel API, I need to check with your table UDF PR again.

I cannot find the scalar function API online at all, so I can't check ATM, but table functions can specify their max threads that they can execute on, and executing on more than one thread of course requires being aware of this and handling local vs global state (thus I implemented them as different types). I don't know if something similar exists for scalar functions

I'll add the chunk API 👍 Even though we are still missing helper functions like SetColumn.

Ah yeah I see the problem, I currently use chunk.SetValue(i, 0, d.count) in the chunk variant, so I still set the individual values. This is surprisingly fast, so I don't see a reason we would need a SetColumn right now, although it is of course good to implement this in the future.

taniabogatsch · 2024-09-18T11:44:58Z

This is surprisingly fast, so I don't see a reason we would need a SetColumn right now, although it is of course good to implement this in the future.

Yes, I achieved speed-ups on this with #254.
We hold the memory pointers in the vector, so SetValue performers direct access to them without too much function call overhead, etc. It is not quite at the level of vectorized execution over the chunk, but I am also happy with the performance.

taniabogatsch · 2024-09-18T11:47:28Z

I cannot find the scalar function API online at all, so I can't check ATM, but table functions can specify their max threads that they can execute on, and executing on more than one thread of course requires being aware of this and handling local vs global state (thus I implemented them as different types). I don't know if something similar exists for scalar functions

Ah yes, I've been working with duckdb.h directly, which contains the functions.
I opened an issue here to add the scalar functions to the documentation: duckdb/duckdb-web#3679.

JAicewizard · 2024-09-18T12:05:39Z

The table function documentation is also broken ATM, I will open an issue about that too.

But I just realised that scalar functions probably don't need any state at all, so there probably is no max threads for scalar functions.

taniabogatsch · 2024-09-19T09:38:56Z

Duckdb internally parallelizes scalar function execution, and each chunk is independently executed (on different threads, if available). So, indeed, we do not need states here.

I went over your feedback and pushed some changes.

Clean-up pass.
Documentation of exported structs.
Scalar function example(s).

Could you give another review (after these remaining steps)?

scalar_udf.go

taniabogatsch · 2024-09-19T13:40:34Z

Alright, I've implemented the review feedback (thanks!), and from my side, this should be ready to go in.
What do you think @JAicewizard?

# Conflicts: # data_chunk.go # vector.go

taniabogatsch and others added 11 commits May 24, 2024 13:09

use duckdb's feature branch

f5aaa94

Re-build static libraries

6da3dbd

trigger tests

274e029

trigger tests

0687448

initial commit towards a data chunk api

032239d

towards exposing an amazing DataChunk

a231352

Merge branch 'data-chunks' into scalar

23522f6

initial scalar UDF support

4bf7c62

more initialisation and primitive getter

cc2a104

Merge branch 'data-chunks' into scalar

8781142

remove capi example

b63142c

JAicewizard reviewed May 31, 2024

View reviewed changes

scalar_udf.go Outdated Show resolved Hide resolved

taniabogatsch added the feature [feature] request or PR label Jun 6, 2024

taniabogatsch and others added 3 commits September 9, 2024 13:08

Merge branch 'main' into scalar

dc9ec28

# Conflicts: # Makefile # appender.go # data_chunk.go # deps/darwin_amd64/libduckdb.a # deps/darwin_arm64/libduckdb.a # deps/freebsd_amd64/libduckdb.a # deps/linux_amd64/libduckdb.a # deps/linux_arm64/libduckdb.a # errors.go # types.go # vector.go

merge fixes

6c48a4a

Re-build static libraries

f2ee037

taniabogatsch mentioned this pull request Sep 10, 2024

[Feature] Type interface #272

Merged

taniabogatsch added 4 commits September 11, 2024 15:36

Merge branch 'type-interface' into scalar

6924aae

# Conflicts: # Makefile # deps/darwin_amd64/libduckdb.a # deps/darwin_arm64/libduckdb.a # deps/freebsd_amd64/libduckdb.a # deps/linux_amd64/libduckdb.a # deps/linux_arm64/libduckdb.a # errors.go # types.go

update code to current code base

8b51aba

changes related to type info

905bc5e

trying to get the handle to work

f1d079a

taniabogatsch added 2 commits September 12, 2024 10:26

fix simple example

49f7b7a

test all types in scalar UDFs

1833f7d

JAicewizard reviewed Sep 18, 2024

View reviewed changes

vector.go Outdated Show resolved Hide resolved

scalar_udf.go Outdated Show resolved Hide resolved

scalar_udf.go Outdated Show resolved Hide resolved

started to add feedback

9f8546d

taniabogatsch mentioned this pull request Sep 18, 2024

[C API] Missing scalar function documentation duckdb/duckdb-web#3679

Open

make executor extensible

354897a

JAicewizard reviewed Sep 19, 2024

View reviewed changes

scalar_udf.go Outdated Show resolved Hide resolved

scalar_udf.go Outdated Show resolved Hide resolved

scalar_udf.go Outdated Show resolved Hide resolved

scalar_udf.go Show resolved Hide resolved

scalar_udf.go Outdated Show resolved Hide resolved

JAicewizard reviewed Sep 19, 2024

View reviewed changes

scalar_udf.go Outdated Show resolved Hide resolved

taniabogatsch added 4 commits September 19, 2024 13:45

review feedback and documentation

7fd79ca

tidy tests, add pinner, other nits

493f151

nits

11cb4ac

add tests

172a22b

taniabogatsch marked this pull request as ready for review September 19, 2024 13:39

taniabogatsch force-pushed the scalar branch from c7b599c to 172a22b Compare September 19, 2024 15:00

taniabogatsch mentioned this pull request Sep 20, 2024

Add support for create_function() #267

Closed

taniabogatsch added 3 commits September 23, 2024 14:07

Merge branch 'main' into scalar

147fdfa

# Conflicts: # data_chunk.go # vector.go

formatter

5e287cd

uuid cast

5b389f0

taniabogatsch requested a review from JAicewizard September 23, 2024 14:30

taniabogatsch added 3 commits September 23, 2024 16:55

create udf utility file

7906697

Merge branch 'main' into scalar

847098b

nit

649f3c9

taniabogatsch merged commit aa90038 into marcboeker:main Sep 23, 2024
4 checks passed

taniabogatsch deleted the scalar branch September 23, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Scalar UDF support #222

[Feature] Scalar UDF support #222

taniabogatsch commented May 31, 2024 •

edited

Loading

JAicewizard commented May 31, 2024 •

edited

Loading

JAicewizard commented May 31, 2024

taniabogatsch commented Jun 6, 2024

JAicewizard commented Jun 6, 2024

taniabogatsch commented Sep 11, 2024

taniabogatsch commented Sep 11, 2024 •

edited

Loading

JAicewizard commented Sep 12, 2024

JAicewizard left a comment

taniabogatsch commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

JAicewizard commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

JAicewizard commented Sep 18, 2024

taniabogatsch commented Sep 19, 2024 •

edited

Loading

taniabogatsch commented Sep 19, 2024

[Feature] Scalar UDF support #222

[Feature] Scalar UDF support #222

Conversation

taniabogatsch commented May 31, 2024 • edited Loading

JAicewizard commented May 31, 2024 • edited Loading

JAicewizard commented May 31, 2024

taniabogatsch commented Jun 6, 2024

JAicewizard commented Jun 6, 2024

taniabogatsch commented Sep 11, 2024

taniabogatsch commented Sep 11, 2024 • edited Loading

JAicewizard commented Sep 12, 2024

JAicewizard left a comment

Choose a reason for hiding this comment

taniabogatsch commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

JAicewizard commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

taniabogatsch commented Sep 18, 2024

JAicewizard commented Sep 18, 2024

taniabogatsch commented Sep 19, 2024 • edited Loading

taniabogatsch commented Sep 19, 2024

taniabogatsch commented May 31, 2024 •

edited

Loading

JAicewizard commented May 31, 2024 •

edited

Loading

taniabogatsch commented Sep 11, 2024 •

edited

Loading

taniabogatsch commented Sep 19, 2024 •

edited

Loading