Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add guide to adding a new expression #422

Merged
merged 9 commits into from
May 20, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented May 13, 2024

Which issue does this PR close?

Part of #370 (haven't added anything on cast or aggregate, mainly scalar func)

Rationale for this change

Make it easier for people to contribute new expressions.

What changes are included in this PR?

A new markdown file in the contributor guide.

How are these changes tested?

Not yet, I haven't tried building the docs. Will do before marking this PR ready.

Built the docs and reviewed the HTML.

@tshauck
Copy link
Contributor Author

tshauck commented May 15, 2024

Notes from comet community meeting:

  • Possible to get link for presentation slides and video?
  • CometSparkSessionExtension injects rules into spark session extension
    • E.g. replace spark scan with comet scan
    • Replace spark execution with comet execution class
  • QueryPlanSerde takes spark expression an serializes via proto
  • Cast/types support is not fully supported (e.g. Add(left, right, _) checks if supportedDataTypes(left.dataType).
  • operator.proto defines Operator
  • message Expr... need to update expr.proto (See Add for example)
    • left, right, return type
  • PhysicalPlanner is the physical planner struct
    • create_expr and create_plan are two most important parts
    • gets children from protobuf message and calls create_expr or create_plan to generate the physical plan
    • spark_expr.expr_struct is matched on then passed to appropriate part
    • Can sometimes use built-in support for expressions (e.g. is not null expression), this only works if the implement is compatible w/ spark (e.g. cast)
  • Adding a Cast
    • datafusion's PhysicalExpr is implemented for Cast.
  • Spark version differences (e.g. failOnError is a common change from 3.3 to 3.4+)
    • use shim layer to handle API differences
  • CometExpressionSuite is for testing
    • use checkSparkAnswerAndOperator

Questions:

  • Which expressions to support? Try to get support for the full expression.
    • 3.4 main difference is failOnError
    • Expect 4.0 to be reasonably difference
  • How to handle new expressions? Add new expression into datafusion if very common expression with standard behavior
  • Which other areas are important? Aiming for complete coverage... CAST is recently a hotbed of activity and high priority... tests are important
  • Unrelated to expressions: first release? Andy has started copying over the basic datafusion scripts to get out a source release, but ultimate goal is jar file with compiled code.

@andygrove
Copy link
Member

@tshauck video link is https://drive.google.com/file/d/1POU4lFAZfYwZR8zV1X2eoLiAmc1GDtSP/view?usp=sharing

@tshauck tshauck force-pushed the expression-docs branch from e8b18fb to 1f3e8d0 Compare May 17, 2024 03:51
@tshauck
Copy link
Contributor Author

tshauck commented May 17, 2024

I think this is ready for review. The first draft I wrote was pretty slanted towards scalar function expressions since that's what I did w/ unhex, but after the meeting on Wed, I made some updates so hopefully it's more general. I'm not 100% sure on the things I haven't touched yet, so certainly open to feedback on any of it.

Here's what it looks like included in the docs...

image

@tshauck tshauck marked this pull request as ready for review May 17, 2024 03:57
@andygrove
Copy link
Member

This is looking great @tshauck. Could you fix the merge conflict?

@tshauck tshauck force-pushed the expression-docs branch from a599c17 to fda09ed Compare May 19, 2024 17:28
@tshauck
Copy link
Contributor Author

tshauck commented May 19, 2024

Thanks, @andygrove -- I think the conflict is resolved now.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI

docs/source/index.rst Outdated Show resolved Hide resolved
@tshauck
Copy link
Contributor Author

tshauck commented May 20, 2024

@andygrove Conflict should actually be fixed now, thanks @kazuyukitanimura

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @tshauck

@andygrove andygrove merged commit 9acb3df into apache:main May 20, 2024
@tshauck tshauck deleted the expression-docs branch May 20, 2024 20:43
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* docs: add guide to adding a new expression

* docs: revise with presentation info

* docs: fix warning

* docs: fix header level

* docs: better info about handling datafusion udfs

* docs: grammar/typos/etc

* docs: clarify datafusion/datafusion comet path

* docs: clarify language about `isSpark32`

* docs: fix error

(cherry picked from commit 9acb3df)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants