-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Create new datafusion-comet-spark-expr
crate containing Spark-compatible DataFusion expressions
#638
Conversation
native/spark-expr/Cargo.toml
Outdated
# under the License. | ||
|
||
[package] | ||
name = "datafusion-comet-expr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether to name this datafusion-spark-expr
or datafusion-comet-expr
. I think either could work.
I went with the latter because the versioning of this crate will be different from the version numbers used for other datafusion-*
crates from the core project and thought that may cause confusion.
I also think it is unlikely that we would move this crate to the core project because we would want the implementation and tests to live in the same repo and the tests depend on Apache Spark, which would be a huge burden to add to the core project.
cc @alamb in case you have any opinions on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think datafusion-comet-expr
sounds good to me
You could get the best of both worlds like datafusion-comet-spark-expr
but that is probably only useful if people end up using this crate outside the context of comet
Maybe if this crate really becomes a great set of spark compatible functions, that would be a good time to consider moving it to the core repo and releasing it along side the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer datafusion-spark-expr
as it's a more general name so that the potential third party could find this crate more easily.
I agree that we may never move this crate to the core of the DataFusion project, but it's still possible we may want to move this crate to a datafusion-contrib repo once the supported expressions are stable and mature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking more about this, and I also now think we should name it datafusion-spark-expr
because the functions are not specific to the Comet project. I will make that change.
I don't think we would ever want to move the code into datafusion-contrib
because that is outside of Apache Software Foundation governance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would ever want to move the code into datafusion-contrib because that is outside of Apache Software Foundation governance.
Thanks for your explanation, I think it makes total sense and agree with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like datafusion-comet-spark-expr
fwiw. It would be nice for all things versioned together to have the sama datafusion-comet prefix, and as the reason to use these exprs over DF’s inbuilt one is to align with spark, having that in the name would make sense to me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the crate to datafusion-comet-spark-expr
.
@advancedxy @viirya @huaxingao what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
I have renamed the crate to
datafusion-comet-spark-expr
.@advancedxy @viirya @huaxingao what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
0ed4d08
to
00751b0
Compare
native/spark-expr/src/abs.rs
Outdated
@@ -68,17 +71,15 @@ impl ScalarUDFImpl for CometAbsFunc { | |||
|
|||
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue, DataFusionError> { | |||
match self.inner_abs_func.invoke(args) { | |||
Err(DataFusionError::ArrowError(ArrowError::ComputeError(msg), trace)) | |||
Err(DataFusionError::ArrowError(ArrowError::ComputeError(msg), _trace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change, because _trace
is never used?
We can simply change it to _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we get an overflow error then we don't care about the trace. I have renamed it to _
.
native/spark-expr/Cargo.toml
Outdated
# under the License. | ||
|
||
[package] | ||
name = "datafusion-comet-expr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer datafusion-spark-expr
as it's a more general name so that the potential third party could find this crate more easily.
I agree that we may never move this crate to the core of the DataFusion project, but it's still possible we may want to move this crate to a datafusion-contrib repo once the supported expressions are stable and mature.
datafusion-comet-expr
crate containing Spark-compatible DataFusion expressionsdatafusion-spark-expr
crate containing Spark-compatible DataFusion expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Pending CI passes.
BTW, I'm not familiar with the cargo related changes, so it would be great that others can take a look at that.
native/spark-expr/Cargo.toml
Outdated
datafusion-functions = { workspace = true } | ||
|
||
[lib] | ||
name = "datafusion_spark_expr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder why we don't have comet in the library name? Are we planning to move these to datafusion? I personally prefer to keep spark specified stuffs in Comet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a discussion about this at #638 (comment)
|
||
[package] | ||
name = "datafusion-spark-expr" | ||
description = "DataFusion expressions that emulate Apache Spark's behavior" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description = "DataFusion expressions that emulate Apache Spark's behavior" | |
description = "DataFusion Comet expressions that emulate Apache Spark's behavior" |
datafusion-spark-expr
crate containing Spark-compatible DataFusion expressionsdatafusion-comet-spark-expr
crate containing Spark-compatible DataFusion expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry forgot to submit my review
native/spark-expr/Cargo.toml
Outdated
# under the License. | ||
|
||
[package] | ||
name = "datafusion-comet-expr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think datafusion-comet-expr
sounds good to me
You could get the best of both worlds like datafusion-comet-spark-expr
but that is probably only useful if people end up using this crate outside the context of comet
Maybe if this crate really becomes a great set of spark compatible functions, that would be a good time to consider moving it to the core repo and releasing it along side the others
datafusion-functions = { workspace = true } | ||
|
||
[lib] | ||
name = "datafusion_comet_spark_expr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
…-compatible DataFusion expressions (apache#638) * convert into workspace project * update GitHub actions * update Makefile * fix regression * update target path * update protobuf path in pom.xml * update more paths * add new datafusion-comet-expr crate * rename CometAbsFunc to Abs and add documentation * fix error message * improve error handling * update crate description * remove unused dep * address feedback * finish renaming crate * update README for datafusion-spark-expr * rename crate to datafusion-comet-spark-expr
Which issue does this PR close?
Part of #630
Also contains workaround for #642 that was discovered as part of this PR
Rationale for this change
This PR demonstrates how we can start moving expressions into a separate crate that other Rust projects can use, without having dependencies on JVM or Spark.
What changes are included in this PR?
abs
expression to new crate as ther first example. If this approach is acceptable, we can follow up with PRs to move the other functions overHow are these changes tested?
Existing tests + new tests