Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table function UDFs #201

Merged
merged 20 commits into from
Oct 16, 2024
Merged

Table function UDFs #201

merged 20 commits into from
Oct 16, 2024

Conversation

JAicewizard
Copy link
Contributor

This implements table function UDFs.

The function takes in an interface, from which it detects the type of the parameters and return values.
This interface also has a method used to generate new rows

The interface is quite simple, but not ideal. There is an issue that we need to obtain the type of the UDF from the user-input, but we also don't want the user to specify the duckdb types manually. This solution allows the user to return arbitrary types, and they will be matched to duckdb types automatically.

Another limitation is that the value interface only allows retrieving string and int64 types. This is only a limitation for the arguments, however it is a large limitation.

@JAicewizard
Copy link
Contributor Author

PS, dont merge yet, I would like a review but c allocations are handled sloppily, and often not de-allocated. Also more return types can (and should) be implemented still

@marcboeker
Copy link
Owner

@JAicewizard Thanks for the contribution. I think it makes sense to also add a lot of test cases for this feature.

@ajzo90
Copy link
Contributor

ajzo90 commented Apr 28, 2024

Great initiative! Based on this contribution I have experimented with a different API for my own needs. In particular it support predicate pushdown, concurrent scanners, and exposes an API to work with table states.

udf.go
example

@JAicewizard
Copy link
Contributor Author

@ajzo90 Those are some great changes, one thing I didn't want to do however was add a bunch of Chunk and Vector APIs to this library. I agree that they would be great, I think leaving this PR to as un-vectorised, we cal always add a vectorised version later.

In the design of the Chunk and Vectorapis it would be wise to also think of a way we might be able to vectorise the Appender API at the same time. (The scanner type for example looks a lot like something that could be used in the appender).
The first thing I will do it create benchmarks to see what the bottle-neck is, my suspicion is that it is the conversion to any for every value, which should be removable.

I will however definitely look at how you did predicatepushdown, I do not really know what this actually does so I didn't touch it.

@JAicewizard
Copy link
Contributor Author

@JAicewizard Thanks for the contribution. I think it makes sense to also add a lot of test cases for this feature.

I will, and a benchmark as well. This APi is very much not optimal, but I am glad you're open to this PR :)

@JAicewizard
Copy link
Contributor Author

I implemented a generic interface for the vector type which allows me to speed up the UDF by 3x, as it no longer needs to push the values into an interface (at least for primitive types).

However it adds some new functions in appender_vector.go, so I would like a quick look at this code. (note that go does not allow generic parameters on methods).

If all is well I can implement some of the features from ajzos branch to make this feature complete!

@JAicewizard
Copy link
Contributor Author

JAicewizard commented May 12, 2024

todo list:

  • named parameters
  • max threads
  • testing of projection pushdown
  • making sure all allocations are freed

@JAicewizard
Copy link
Contributor Author

I force-pushed to cleanup the history. This PR ended up growing a bit out of hand, especially due to the wanting a safe API for setting the types of columns, without forcing the user to pass us a value. I think the current version is nice, as it the duckdb logical type is fully managed by go-duckdb, while also giving users a nice API.

One thing I am not entirely happy with is the integration with the existing vector type, however I did not feel the need to implement a fully custom one just for table functions. The added code allows for setting values without having to turn them into an any. This significantly speeds up table functions, as converting a value into any does a heap allocation which is expensive.

I think this ready for a review.

@taniabogatsch
Copy link
Collaborator

Hi @JAicewizard, with the release of duckdb 0.10.3, we now also expose scalar UDFs in the C API: duckdb/duckdb#11786. I am currently looking into adding support for this to go-duckdb. Your PR is a great help! 😊 I noticed that there are many functions that both implementations could benefit from, like your changes to type.go. I am wondering on how to best organise efforts to avoid writing similar code twice. 🤔 Maybe I could work off of your PR, or maybe you have other suggestions?

@taniabogatsch
Copy link
Collaborator

I am also open to finalising this PR first, and then starting on the scalar UDFs. In that case, I can give a thorough review, or open a PR to your PR with suggestions? Let me know what you think!

@taniabogatsch
Copy link
Collaborator

Ignore part of my comments, and sorry for any confusion. The scalar UDFs are on our feature branch, so not yet part of the release. However, they will be in the not-too-far future, haha. So I'll just draft the scalar UDFs, and review this PR with that in mind. 😄

@JAicewizard
Copy link
Contributor Author

haha I was confused about scalar UDFs in 10.3 already. I indeed wrote much of the code in a way that in the future scalar UDFs would also be able to benefit from the infrastructure. Thats also why I put the Type and Row in separate files.

I was also thinking about doing scalar UDFs, but you can take that on if you want. I think much of the code can be copied/used for inspiration, but I'm afraid it will be difficult to make much of the callback code shared. Most code is handling parameters and returned values which can't be shared.

After this PR I will start working on another one adding vectorised table UDFs, do you have any ideas on how to implement a nice and fast way to represent a duckdb vector? I was thinking maybe something with a method SetValues[T Any](data []T) which sets the first len(data) values to the specified ones, and clears the others. But I am open for ideas!

A review would definitely be highly appreciated, I can review your code as well once its done!

type.go Outdated Show resolved Hide resolved
@JAicewizard
Copy link
Contributor Author

@taniabogatsch I renamed the files to udtf*.go so that you can create new ones.
It might also be nice for scalar UDFs to be able to extract the type information from the function being passed like in python.

Copy link
Contributor

@ajzo90 ajzo90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with the UDF implementation.
I left a few comments related to projection pushdown and threading. In my opinion it make sense to wait with these (performance related) features until we know where we want to go with vectorised APIs.

udf.go Outdated Show resolved Hide resolved
udf.go Outdated Show resolved Hide resolved
@JAicewizard
Copy link
Contributor Author

Thank you for the review! I will add some documentation as well soon-ish since that is very much still missing.

@JAicewizard
Copy link
Contributor Author

I rebased and forcepushed to integrate datachunk changes. Also added a datachunk API for tablefunctions.

@JAicewizard
Copy link
Contributor Author

@taniabogatsch Do you have write access? if so could you run CI for this? I think I fixed all the issues in CI, but I would like to make sure.

Also do you know what the next steps are to getting this merged?

@taniabogatsch
Copy link
Collaborator

@JAicewizard, I ran the CI. Regarding the next steps to getting this merged, ideally, we have the DataChunk support in place. But I'll have another look at the current state, with the changes from main, and maybe we can merge it sooner.

Copy link
Collaborator

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I just had another more detailed look at the PR. This is going in a great direction. 😄 I think we still need to iron out two main things before we add the 'finishing touches' to the PR.

  1. Changes to vector and the custom functions there. See my comment on the respective file.
  2. Discuss the Type approach.

Other remarks.

  • Could you also run gofumpt -l -w . on your PR for more uniform formatting?
  • It is nice to see the increased number of tests.
  • Much improved documentation in udtf.go 👍

examples/udf/udf.go Outdated Show resolved Hide resolved
udtf.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
data_chunk.go Outdated Show resolved Hide resolved
data_chunk.go Outdated Show resolved Hide resolved
errors.go Outdated Show resolved Hide resolved
errors.go Outdated Show resolved Hide resolved
vector.go Outdated Show resolved Hide resolved
vector.go Outdated Show resolved Hide resolved
type.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@JAicewizard JAicewizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes I have not pushed yet, ill push them soon.

@JAicewizard
Copy link
Contributor Author

Thanks for the PR, I added a parallel example. While doing so I realised the parallel test was completely wrong and parallel functions did not work at all, so I fixed those too.

@JAicewizard
Copy link
Contributor Author

Uhh oops, I mean review instead of PR

Copy link
Collaborator

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left more comments, but this is probably almost done. Most of my comments are about test coverage.
It's good to hear that you found and fixed that bug in the parallel execution. 😄

udtf_test.go Outdated Show resolved Hide resolved
udtf.go Outdated Show resolved Hide resolved
udtf.go Outdated Show resolved Hide resolved
examples/table_udf_basic/main.go Outdated Show resolved Hide resolved
examples/table_udf_parallel/main.go Outdated Show resolved Hide resolved
@JAicewizard
Copy link
Contributor Author

I started adding more tests, and I found one really odd error. On the latest commit, if you switch around the arguments to SetValue for the chunk test, it returns an error as it should. However, then later duckdb creates a SIGSEGV and I don't really know why. Somehow the chunk seems to be corrupt or something? However I don't see how that could happen

Copy link
Contributor Author

@JAicewizard JAicewizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to press the submit button last time, so there are 6 comments of which I don't know how relevant thy still are. But I fixed all the issues you pointed out

examples/table_udf/main.go Show resolved Hide resolved
examples/table_udf/main.go Outdated Show resolved Hide resolved
examples/table_udf/main.go Outdated Show resolved Hide resolved
examples/table_udf/main.go Outdated Show resolved Hide resolved
udf_utils.go Outdated Show resolved Hide resolved
udtf.go Outdated Show resolved Hide resolved
udtf_test.go Outdated Show resolved Hide resolved
udtf.go Outdated Show resolved Hide resolved
@taniabogatsch
Copy link
Collaborator

Hi @JAicewizard - thanks for the changes! If you don't mind, then I'll go ahead, run some formatting and spell checking, and some minor renaming, and nits. And merge it afterwards?

@taniabogatsch
Copy link
Collaborator

I started adding more tests, and I found one really odd error. On the latest commit, if you switch around the arguments to SetValue for the chunk test, it returns an error as it should. However, then later duckdb creates a SIGSEGV and I don't really know why. Somehow the chunk seems to be corrupt or something? However I don't see how that could happen

I'll also have a look at this 👍

# Conflicts:
#	deps/darwin_amd64/libduckdb.a
#	deps/darwin_arm64/libduckdb.a
#	errors.go
@taniabogatsch
Copy link
Collaborator

JAicewizard#2

@taniabogatsch taniabogatsch merged commit 7c41608 into marcboeker:main Oct 16, 2024
6 checks passed
@taniabogatsch
Copy link
Collaborator

@JAicewizard, many thanks for all your work on this! 🥳

@JAicewizard
Copy link
Contributor Author

It was a long road, but a lot of things had to happen for this to be mergable :)

@taniabogatsch
Copy link
Collaborator

Haha yes! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature [feature] request or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants