-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Generate the supported Spark builtin expression list into MD file #455
Conversation
…uite.scala Co-authored-by: advancedxy <[email protected]>
This is very cool @comphead but it looks like it is not detecting any of the aggregate functions that we support? |
docs/spark_expressions_support.md
Outdated
- [ ] ifnull | ||
- [ ] nanvl | ||
- [x] nullif | ||
- [ ] nvl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it should be supported? It's essential the same as coalesce
, which is replaced during analysis phase.
Maybe we should file an issue to track this kind of problem.
|
||
### hash_funcs | ||
- [ ] crc32 | ||
- [ ] hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash should be supported.
Ah, the example hash function passes an array, which haven't been supported in Comet as supported type yet.
Maybe supporting nested type could be prioritized after we have fully TPC-H and TPC-DS support.
@andygrove @advancedxy I fixed the test, implementing extra parsing and manual small tests if the parsing is complicated. I hope now we have better picture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments, otherwise LGTM.
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala
Show resolved
Hide resolved
Thanks @advancedxy I fixed the flaws you mentioned. However I'd like to make refactoring you recommended in followup PR, this PR getting too large for review |
Of course, sounds good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment, others LGTM.
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
- [x] regr_avgx | ||
- [x] regr_avgy | ||
- [x] regr_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we support these expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is exactly for year, but if only YEAR supported, what is supposed to show to the user? Not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test("regr_avgx") {
Seq(false, true).foreach { dictionary =>
withSQLConf(
"parquet.enable.dictionary" -> dictionary.toString,
"spark.comet.exec.shuffle.enabled" -> "true",
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key -> "true",
CometConf.COMET_EXEC_ALL_OPERATOR_ENABLED.key -> "true",
) {
val table = "test"
withTable(table) {
sql(s"create table $table(a int, b int) using parquet")
sql(s"insert into $table VALUES (1, 2), (2, 2), (2, 3), (2, 4)")
checkSparkAnswerAndOperator(s"SELECT regr_avgx(a, b) FROM $table")
}
}
}
}
regr_avgx test passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regr_avgx supported by DF
> SELECT regr_avgx(1, 2);
+------------------------------+
| REGR_AVGX(Int64(1),Int64(2)) |
+------------------------------+
| 2.0 |
+------------------------------+
so I think all is fair here
docs/spark_expressions_support.md
Outdated
- [x] try_add | ||
- [x] try_divide | ||
- [x] try_multiply | ||
- [x] try_subtract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any tests for these functions and the planner.rs
seems to ignore the fail_on_error
flag in the protobuf. If your tool says we do support them then we likely do support them but not correctly. I will file an issue to look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we already have an issue: #280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test("try_add") {
Seq(false, true).foreach { dictionary =>
withSQLConf(
"parquet.enable.dictionary" -> dictionary.toString,
"spark.comet.exec.shuffle.enabled" -> "true") {
val table = "test"
withTable(table) {
sql(s"create table $table(a int, b int) using parquet")
sql(s"insert into $table VALUES (1, 2)")
checkSparkAnswerAndOperator(s"SELECT try_add(a, b) FROM $table")
}
}
}
}
such test passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is Comet
== Physical Plan ==
*(1) ColumnarToRow
+- CometProject [try_add(a, b)#57], [(a#40 + b#41) AS try_add(a, b)#57]
+- CometScan parquet spark_catalog.default.test[a#40,b#41] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/ovoievodin/dev/prj/apple/ovoievodin/rust/arrow-datafusion-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int>
the answer the same as Spark which is 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we support try_* functions, no fallback happens because we handle
case add @ Add(left, right, _)
where the last argument is responsible for TRY mode.
so no fallback but calculatin not correct when overflow happens, I'll move try tests to manual
- [x] current_catalog | ||
- [x] current_database | ||
- [x] current_schema | ||
- [x] current_user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a native implementation for these methods. I am guessing that the Spark planner replaces these with literal values and we then see a CometProject for those literal values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. These expressions are replaced with literals after analysis stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are purely spark execution, are we planning to natively support them in Comet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot support these natively in Comet because these functions will never appear in the physical plan. Spark replaces them with literal values before this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, I'll remove them from the list
- [ ] date_diff | ||
- [ ] date_format | ||
- [ ] date_from_unix_date | ||
- [x] date_part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support date_part
for YEAR
- [ ] dayofmonth | ||
- [ ] dayofweek | ||
- [ ] dayofyear | ||
- [x] extract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support extract
for YEAR
@andygrove I fixed all the comments, however you are right, sometimes we support partially the function. here comes an idea for follow up PR to introduce partially supported status(or similar) with the reason why it is supported partially |
docs/source/user-guide/overview.md
Outdated
@@ -29,7 +29,7 @@ Comet aims to support: | |||
- a native Parquet implementation, including both reader and writer | |||
- full implementation of Spark operators, including | |||
Filter/Project/Aggregation/Join/Exchange etc. | |||
- full implementation of Spark built-in expressions | |||
- [full implementation](../../../docs/spark_expressions_support.md) of Spark built-in expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't build correctly:
/Users/andy/git/apache/datafusion-comet/docs/temp/user-guide/overview.md:32: WARNING: Unknown source document '../spark_expressions_support' [myst.xref_missing]
Let's revert this change for this PR and handle where we publish (user guide vs contributor guide) in a follow-up PR.
Co-authored-by: Andy Grove <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comphead. LGTM.
The content generated by this tool is helpful and I learned more about what we support from reviewing this PR.
My opinion is that the generated docs make sense for the contributors guide to help guide people on where to contribute but that we should keep a human-curated version for the user guide where we can add more context about supported expressions, but we can discuss this on a future PR.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #455 +/- ##
============================================
+ Coverage 34.18% 34.23% +0.04%
+ Complexity 851 806 -45
============================================
Files 116 105 -11
Lines 38570 38488 -82
Branches 8531 8562 +31
============================================
- Hits 13187 13175 -12
+ Misses 22612 22554 -58
+ Partials 2771 2759 -12 ☔ View full report in Codecov by Sentry. |
…ile (apache#455) * Coverage: Add a manual test for DF supporting Spark expressions directly Co-authored-by: advancedxy <[email protected]> Co-authored-by: Andy Grove <[email protected]>
Which issue does this PR close?
Follow up on #331
Closes #282
Related #240
Rationale for this change
Generate the supported Spark builtin expression list into MD file instead of txt file. Txt file still exists as it provides the reason why some expression is not supported
What changes are included in this PR?
How are these changes tested?