-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support struct_pack function if duckdb enabled #68
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
2bb268c
to
06c7e91
Compare
@@ -460,11 +461,27 @@ def __lt__(self, obj) -> bool: | |||
i64=type_pb2.Type.I64( | |||
nullability=type_pb2.Type.Nullability.NULLABILITY_REQUIRED))), | |||
} | |||
SPARK_SUBSTRAIT_MAPPING_FOR_DUCKDB = { | |||
'struct': ExtensionFunction( | |||
'/functions_structs.yaml', 'struct_pack:any_str', type_pb2.Type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This location isn't a standard Substrait extension as defined here: https://github.com/substrait-io/substrait/tree/main/extensions
The way to create structs on the fly in Substrait is with the Nested feature: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L915
The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.
That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Re:
This location isn't a standard Substrait extension
This was attempted based on the "struct_extract
" usage here , and it worked 🤞 :
'struct_extract': ExtensionFunction(
'/functions_structs.yaml', 'struct_extract:any_str', type_pb2.Type(
- Re:
The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.
I think I understand.
- Re:
I'm not sure how many backends implement the Nested expression feature
I think DuckDB does not support STRUCT
logical types. I've only tried the Substrait - DuckDB extension so far, to produce substrait from a DuckDB-supported query. It fails with a "Not implemented error" for queries with struct_pack
usages, e.g.
CALL get_substrait_json("
SELECT
struct_pack(cust_name:=c_name, cust_key:=c_custkey) as test_struct
FROM
read_parquet('<base_path>/third_party/tpch/parquet/customer/*.parquet') LIMIT 10
")
Is there another good way to test production and consumption of Substrait from DuckDB-queries?
- Re:
I will check with the DuckDB folks tomorrow to see if they have time to add it.
I'd be happy to subscribe to a thread, and maybe find time to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.
In case it helps, here's a snippet showing "nested-expression"-like support in DuckDb
(used via the polars LazyFrame
API) -
import duckdb
import polars as pl
parquet_path = "<basepath>/third_party/tpch/parquet/customer"
df = (pl.scan_parquet(parquet_path)
.select(pl.col("c_custkey").alias("cust_key"), pl.col("c_name").alias("cust_name"))
.select(pl.struct(pl.col("cust_key"), pl.col("cust_name")).alias("test_struct"))
.select(pl.col("test_struct").struct.field("cust_key"), pl.col("test_struct").struct.field("cust_name"))
)
duckdb.sql("SELECT * from df limit 10").show()
Enables
unresolved_function
mappings forstruct_pack
as shown below:struct_extract
as shown below:fixes #67