Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support struct_pack function if duckdb enabled #68

Closed

Conversation

pat70
Copy link
Contributor

@pat70 pat70 commented Aug 7, 2024

Enables unresolved_function mappings for

  • struct_pack as shown below:
(get_customer_database(spark_session)
 .select(struct(col('c_custkey'), col('c_name')).alias('test_struct'))
 .show())
  • and struct_extract as shown below:
(get_customer_database(spark_session)
 .select(struct(col('c_custkey'), col('c_name')).alias('test_struct'))
 .select(col('test_struct').getField('c_name'))
 .show())

fixes #67

@pat70 pat70 marked this pull request as draft August 7, 2024 02:04
Copy link

github-actions bot commented Aug 7, 2024

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@pat70 pat70 changed the title Spark function-mappings to DuckDB counterpart when appropriate feat: support struct_pack function if duckdb enabled Aug 7, 2024
@pat70 pat70 force-pushed the spark-mappings-if-duckdb-enabled branch from 2bb268c to 06c7e91 Compare August 7, 2024 02:07
@@ -460,11 +461,27 @@ def __lt__(self, obj) -> bool:
i64=type_pb2.Type.I64(
nullability=type_pb2.Type.Nullability.NULLABILITY_REQUIRED))),
}
SPARK_SUBSTRAIT_MAPPING_FOR_DUCKDB = {
'struct': ExtensionFunction(
'/functions_structs.yaml', 'struct_pack:any_str', type_pb2.Type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location isn't a standard Substrait extension as defined here: https://github.com/substrait-io/substrait/tree/main/extensions

The way to create structs on the fly in Substrait is with the Nested feature: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L915

The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.

That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.

Copy link
Contributor Author

@pat70 pat70 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Re:
    This location isn't a standard Substrait extension

This was attempted based on the "struct_extract" usage here , and it worked 🤞 :

    'struct_extract': ExtensionFunction(
        '/functions_structs.yaml', 'struct_extract:any_str', type_pb2.Type(
  • Re:
The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.

I think I understand.

  • Re:
    I'm not sure how many backends implement the Nested expression feature

I think DuckDB does not support STRUCT logical types. I've only tried the Substrait - DuckDB extension so far, to produce substrait from a DuckDB-supported query. It fails with a "Not implemented error" for queries with struct_packusages, e.g.

CALL get_substrait_json("
SELECT 
struct_pack(cust_name:=c_name, cust_key:=c_custkey) as test_struct 
FROM 
read_parquet('<base_path>/third_party/tpch/parquet/customer/*.parquet') LIMIT 10
")

Is there another good way to test production and consumption of Substrait from DuckDB-queries?

  • Re:
    I will check with the DuckDB folks tomorrow to see if they have time to add it.

I'd be happy to subscribe to a thread, and maybe find time to help.

Copy link
Contributor

@pthatte1-bb pthatte1-bb Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.

In case it helps, here's a snippet showing "nested-expression"-like support in DuckDb (used via the polars LazyFrame API) -

import duckdb
import polars as pl

parquet_path = "<basepath>/third_party/tpch/parquet/customer"
df = (pl.scan_parquet(parquet_path)
      .select(pl.col("c_custkey").alias("cust_key"), pl.col("c_name").alias("cust_name"))
      .select(pl.struct(pl.col("cust_key"), pl.col("cust_name")).alias("test_struct"))
      .select(pl.col("test_struct").struct.field("cust_key"), pl.col("test_struct").struct.field("cust_name"))
      )
duckdb.sql("SELECT * from df limit 10").show()

@pat70 pat70 closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Spark's "struct" function
3 participants