API draft to create and scan data chunks #219

taniabogatsch · 2024-05-27T16:24:08Z

After thinking about some of the discussions in #201, I've decided to draft a PR for creating and scanning data chunks in the go-duckdb driver. Maybe most importantly, we should decide on how the dataChunk and vector look, and what functions we want for them. Ideally, all data chunk and vector operations should go through functions in the new files data_chunk.go and vector.go.

Maybe I went in a completely wrong direction here, what do you think @ajzo90 @JAicewizard?

Questions

Which functions should be exposed? This is relevant to support table UDFs as discussed here.
The biggest overhead in the Appender are the function calls via the callbacks and the type casting. @JAicewizard made changes to the vector functions in appender_vector.go to avoid type casting overhead. How could this be ported to this PR to improve efficiency without breaking the Appender? See comment here.
- A TODO for me is to look at this again tomorrow and see how we can avoid any in the Appender.

FIXMEs and future work

Add support for AppendColumn or similar to append values in a columnar approach. See @ajzo90's usage here.
Here is also a comment w.r.t. adding this.

I was thinking maybe something with a method SetValues[T Any](data []T) which sets the first len(data) values to the specified ones, and clears the others. But I am open for ideas!

Add support for scanning a data chunk.

Other comments

The function takes in an interface, from which it detects the type of the parameters and return values.

This sounds very similar to some of the type inference in the Appender.

In the design of the Chunk and Vector apis it would be wise to also think of a way we might be able to vectorise the Appender API at the same time. (The scanner type for example looks a lot like something that could be used in the appender).

I only found this comment. Maybe you could point me to the scanner type?

// Use as the `Scanner` type for any composite types (maps, lists, structs)
type Composite[T any] struct {
	t T
}

More future TODOs

GetMetaData, see table UDF PR.
Add all templated functions.
Projection pushdown.
Support for columnar setters and getters. I removed them from this PR for scope.

// SetColumn sets the column to val, where val is a slice []T. T is the type of the column.
func (chunk *DataChunk) SetColumn(colIdx int, val any) error {
	// TODO
	return errNotImplemented
}

// GetColumn returns a slice []T containing the column values. T is the type of the column.
func (chunk *DataChunk) GetColumn(colIdx int) (any, error) {
	// TODO
	return nil, errNotImplemented
}

// GetColumnData returns a pointer to the underlying data of a column.
func (chunk *DataChunk) GetColumnData(colIdx int) (unsafe.Pointer, error) {
	// TODO
	return nil, errNotImplemented
}

taniabogatsch · 2024-05-27T16:34:16Z

One more thought is that we could keep the current slower row-at-a-time chunk creation and only consider performance when adding the columnar chunk creation. For the columnar chunk creation, we'll only have to infer the type once and can provide dedicated functions for specific (common) types.

Even the appender could be extended with a function AppendColumn(s).

ajzo90 · 2024-05-27T17:17:46Z

Thanks. This is definitely in the right direction. Adding some notes from my perspective:

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage. I don't think the interface is worthy for this library since it should return the DataChunk instead, but it shows how I'm using it to fetch vectors currently. I can add some test files to show example usage if you think it make sense.

I found myself implementing a few customised functions (StringVecHash, StringVecDict etc mostly to avoid memory allocations/string conversions). Most of these would not be necessarily when the scalar function UDF if available, but making the data_chunk/vectors easy to extend for similar cases would be nice anyway. Note that my fork primarily considers data types of my own needs, and it does not implement all data types.

Features I like to be supported later:

Serialization and deserialization for DataChunk and Vector. [Append]Serialize([]byte)([]byte, error) Deserialize([]byte)([]byte, error). Perhaps the C-API can be extended to support duckdbs in-memory serialization?

JAicewizard · 2024-05-28T09:08:41Z

I think this looks good, although I would prefer to have a different initialisation part, as there is currently appender-centric. Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Also there is no public API yet right? I think that is the most important part. But this is going in the right direction for sure, I have no real comments at the moment, looks good!

There are still some open problems/questions at the moment, I think mostly the public API.

What I would have something like the following as API:

GetValue(c,r int, val any) error
SetValue(c,r int) (val error)
GetValues(c int) (vals any, error) // vals is always a slice []T where T is the type of the column
SetValues(c, vals any) (val error) // vals should be a slcie []T where T is the type of the column
GetData(c int) unsafe.Pointer
Columns() []ColumnMetaData // ColumnMetaData from my PR, could be moved/copied here

and other generic functions that cant sadly be methods

ColumnGetValue[T](chunk DataChunk, c,r int, val T) error
ColumnSetValue[T](chunk DataChunk, c,r int) (T, error)
ColumnGetValues[T](chunk DataChunk, c int) ([]T, error)
ColumnSetValues[T](chunk DataChunk, c, vals []T) (val error)
ColumnGetData[T](c int) ([]T, error) // Error only if the type cannot be cast

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

Another question would be whether or not we want to support projected columns at the datachunk level. I think there is a benefit to doing this. For table functions this would be very useful to transparently implement projectionpushdown which is very nice for the end-user.

taniabogatsch · 2024-05-28T10:30:25Z

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage.

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Great suggestion!

Also there is no public API yet right? I think that is the most important part.

Yes, I was curious which functions to expose. And you gave the answer haha, I've added some of them, and will add the remaining ones in different commits.

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

No, looks great. I am very sad about that, hahah. But yes, we work with what we have. :)

JAicewizard · 2024-05-28T10:35:40Z

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Good point. We can leave out the generic functions for vector operations until further someone looks into this. I think the bottleneck would still be getting the data pointer from C, but this can be optimized away since it doesn't (really) change. Once this optimization is done we can always look into it and potentially add generic functions.

JAicewizard · 2024-05-28T10:36:14Z

(generic and/or specialized, applies to both)

# Conflicts: # rows.go # types.go

…s into the DataChunk API - will probably have to disentangle this into multiple smaller PRs

# Conflicts: # appender_vector.go # rows.go

JAicewizard · 2024-06-16T17:47:40Z

What is the progress on this? The merges from main make it difficult to see what changed. I feel like the UDF additions have stalled a bit and are starting to drag (might be partially my fault)

taniabogatsch · 2024-06-18T07:14:46Z

Hi @JAicewizard, No worries. I can update you on my/our current plan/roadmap for the DataChunk API.

Initial DataChunk API support #233 to expose the DataChunk. But without the aim to have sufficient support already.
- Fixes and nits after merging.
Expand the Appender to support the same column types as the scanner. I'm mostly done; I just need to write a few more tests and open a PR.
Moving the scanner into the same DataChunk API to create the getter functions.
Expanding the DataChunk API with missing functionality for UDFs and faster columnar functions.

So, I expect it will still take a few weeks until the (table) UDFs are merged, but ideally, we'll have a unified way for handling data chunks in place and ways to speed up performance.

taniabogatsch · 2024-07-02T08:40:29Z

I'll close this in favor of #240, #237, and #233. Any further (exported) functions can be introduced with the related PRs. Specifically, the next step is to return to and finish #201.

initial commit towards a data chunk api

032239d

towards exposing an amazing DataChunk

a231352

more initialisation and primitive getter

cc2a104

taniabogatsch mentioned this pull request May 31, 2024

[Feature] Scalar UDF support #222

Merged

taniabogatsch added 4 commits May 31, 2024 17:40

moving everything into the data chunk API

2a58233

WIP moving scanning into data chunk API

c8a90ce

Merge branch 'main' into data-chunks

3278788

Merge branch 'data-chunks' into data-chunks-temp

a862a66

taniabogatsch mentioned this pull request Jun 5, 2024

Move typeName to a map and add missing types #226

Merged

taniabogatsch added 3 commits June 5, 2024 13:09

Merge branch 'main' into data-chunks

4162e20

Merge branch 'data-chunks-temp' into data-chunks

ca8f96c

# Conflicts: # rows.go # types.go

started major rewrite to move absolutely everything around data chunk…

cd96c53

…s into the DataChunk API - will probably have to disentangle this into multiple smaller PRs

taniabogatsch added the feature [feature] request or PR label Jun 6, 2024

taniabogatsch added 3 commits June 6, 2024 11:36

Merge branch 'main' into data-chunks

1f99bca

# Conflicts: # appender_vector.go # rows.go

existing tests pass yay

8ae6257

Merge branch 'main' into data-chunks

3aa794b

taniabogatsch mentioned this pull request Jun 7, 2024

Initial DataChunk API support #233

Merged

taniabogatsch closed this Jul 2, 2024

taniabogatsch deleted the data-chunks branch July 2, 2024 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API draft to create and scan data chunks #219

API draft to create and scan data chunks #219

taniabogatsch commented May 27, 2024 •

edited

Loading

taniabogatsch commented May 27, 2024 •

edited

Loading

ajzo90 commented May 27, 2024

JAicewizard commented May 28, 2024

taniabogatsch commented May 28, 2024

JAicewizard commented May 28, 2024

JAicewizard commented May 28, 2024

JAicewizard commented Jun 16, 2024

taniabogatsch commented Jun 18, 2024 •

edited

Loading

taniabogatsch commented Jul 2, 2024

API draft to create and scan data chunks #219

API draft to create and scan data chunks #219

Conversation

taniabogatsch commented May 27, 2024 • edited Loading

taniabogatsch commented May 27, 2024 • edited Loading

ajzo90 commented May 27, 2024

JAicewizard commented May 28, 2024

taniabogatsch commented May 28, 2024

JAicewizard commented May 28, 2024

JAicewizard commented May 28, 2024

JAicewizard commented Jun 16, 2024

taniabogatsch commented Jun 18, 2024 • edited Loading

taniabogatsch commented Jul 2, 2024

taniabogatsch commented May 27, 2024 •

edited

Loading

taniabogatsch commented May 27, 2024 •

edited

Loading

taniabogatsch commented Jun 18, 2024 •

edited

Loading