Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Generate the supported Spark builtin expression list into MD file #455

Merged
merged 21 commits into from
Jun 5, 2024

Conversation

comphead
Copy link
Contributor

@comphead comphead commented May 21, 2024

Which issue does this PR close?

Follow up on #331
Closes #282
Related #240

Rationale for this change

Generate the supported Spark builtin expression list into MD file instead of txt file. Txt file still exists as it provides the reason why some expression is not supported

What changes are included in this PR?

How are these changes tested?

@andygrove
Copy link
Member

This is very cool @comphead but it looks like it is not detecting any of the aggregate functions that we support?

docs/spark_expressions_support.md Outdated Show resolved Hide resolved
- [ ] ifnull
- [ ] nanvl
- [x] nullif
- [ ] nvl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it should be supported? It's essential the same as coalesce, which is replaced during analysis phase.

Maybe we should file an issue to track this kind of problem.


### hash_funcs
- [ ] crc32
- [ ] hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash should be supported.

Ah, the example hash function passes an array, which haven't been supported in Comet as supported type yet.

Maybe supporting nested type could be prioritized after we have fully TPC-H and TPC-DS support.

@comphead
Copy link
Contributor Author

@andygrove @advancedxy I fixed the test, implementing extra parsing and manual small tests if the parsing is complicated. I hope now we have better picture.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, otherwise LGTM.

@comphead
Copy link
Contributor Author

Thanks @advancedxy I fixed the flaws you mentioned. However I'd like to make refactoring you recommended in followup PR, this PR getting too large for review

@advancedxy
Copy link
Contributor

Thanks @advancedxy I fixed the flaws you mentioned. However I'd like to make refactoring you recommended in followup PR, this PR getting too large for review

Of course, sounds good to me.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment, others LGTM.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@comphead comphead requested review from andygrove and viirya May 28, 2024 15:50
Comment on lines +60 to +62
- [x] regr_avgx
- [x] regr_avgy
- [x] regr_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we support these expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is exactly for year, but if only YEAR supported, what is supposed to show to the user? Not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test("regr_avgx") {
    Seq(false, true).foreach { dictionary =>
      withSQLConf(
        "parquet.enable.dictionary" -> dictionary.toString,
        "spark.comet.exec.shuffle.enabled" -> "true",
        CometConf.COMET_ENABLED.key -> "true",
        CometConf.COMET_EXEC_ENABLED.key -> "true",
        CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key -> "true",
        CometConf.COMET_EXEC_ALL_OPERATOR_ENABLED.key -> "true",
      ) {
        val table = "test"
        withTable(table) {
          sql(s"create table $table(a int, b int) using parquet")
          sql(s"insert into $table VALUES (1, 2), (2, 2), (2, 3), (2, 4)")
          checkSparkAnswerAndOperator(s"SELECT regr_avgx(a, b) FROM $table")
        }
      }
    }
  }

regr_avgx test passes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regr_avgx supported by DF

> SELECT regr_avgx(1, 2);
+------------------------------+
| REGR_AVGX(Int64(1),Int64(2)) |
+------------------------------+
| 2.0                          |
+------------------------------+

so I think all is fair here

Comment on lines 332 to 335
- [x] try_add
- [x] try_divide
- [x] try_multiply
- [x] try_subtract
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any tests for these functions and the planner.rs seems to ignore the fail_on_error flag in the protobuf. If your tool says we do support them then we likely do support them but not correctly. I will file an issue to look into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we already have an issue: #280

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  test("try_add") {
    Seq(false, true).foreach { dictionary =>
      withSQLConf(
        "parquet.enable.dictionary" -> dictionary.toString,
        "spark.comet.exec.shuffle.enabled" -> "true") {
        val table = "test"
        withTable(table) {
          sql(s"create table $table(a int, b int) using parquet")
          sql(s"insert into $table VALUES (1, 2)")
          checkSparkAnswerAndOperator(s"SELECT try_add(a, b) FROM $table")
        }
      }
    }
  }

such test passes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is Comet

== Physical Plan ==
*(1) ColumnarToRow
+- CometProject [try_add(a, b)#57], [(a#40 + b#41) AS try_add(a, b)#57]
   +- CometScan parquet spark_catalog.default.test[a#40,b#41] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/ovoievodin/dev/prj/apple/ovoievodin/rust/arrow-datafusion-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int>

the answer the same as Spark which is 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we support try_* functions, no fallback happens because we handle

case add @ Add(left, right, _) 

where the last argument is responsible for TRY mode.
so no fallback but calculatin not correct when overflow happens, I'll move try tests to manual

Comment on lines +343 to +346
- [x] current_catalog
- [x] current_database
- [x] current_schema
- [x] current_user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a native implementation for these methods. I am guessing that the Spark planner replaces these with literal values and we then see a CometProject for those literal values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. These expressions are replaced with literals after analysis stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are purely spark execution, are we planning to natively support them in Comet?

Copy link
Member

@andygrove andygrove Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot support these natively in Comet because these functions will never appear in the physical plan. Spark replaces them with literal values before this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I'll remove them from the list

- [ ] date_diff
- [ ] date_format
- [ ] date_from_unix_date
- [x] date_part
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support date_part for YEAR

- [ ] dayofmonth
- [ ] dayofweek
- [ ] dayofyear
- [x] extract
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support extract for YEAR

@comphead
Copy link
Contributor Author

@andygrove I fixed all the comments, however you are right, sometimes we support partially the function.
means part of syntax or some value range not supported.

here comes an idea for follow up PR to introduce partially supported status(or similar) with the reason why it is supported partially

@comphead comphead requested a review from andygrove June 3, 2024 16:37
@@ -29,7 +29,7 @@ Comet aims to support:
- a native Parquet implementation, including both reader and writer
- full implementation of Spark operators, including
Filter/Project/Aggregation/Join/Exchange etc.
- full implementation of Spark built-in expressions
- [full implementation](../../../docs/spark_expressions_support.md) of Spark built-in expressions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't build correctly:

/Users/andy/git/apache/datafusion-comet/docs/temp/user-guide/overview.md:32: WARNING: Unknown source document '../spark_expressions_support' [myst.xref_missing]

Let's revert this change for this PR and handle where we publish (user guide vs contributor guide) in a follow-up PR.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead. LGTM.

The content generated by this tool is helpful and I learned more about what we support from reviewing this PR.

My opinion is that the generated docs make sense for the contributors guide to help guide people on where to contribute but that we should keep a human-curated version for the user guide where we can add more context about supported expressions, but we can discuss this on a future PR.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.23%. Comparing base (9ca63a2) to head (e8f3b77).
Report is 27 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #455      +/-   ##
============================================
+ Coverage     34.18%   34.23%   +0.04%     
+ Complexity      851      806      -45     
============================================
  Files           116      105      -11     
  Lines         38570    38488      -82     
  Branches       8531     8562      +31     
============================================
- Hits          13187    13175      -12     
+ Misses        22612    22554      -58     
+ Partials       2771     2759      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@comphead comphead merged commit b3ba82f into apache:main Jun 5, 2024
43 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…ile (apache#455)

* Coverage: Add a manual test for DF supporting Spark expressions directly


Co-authored-by: advancedxy <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Make spark builtin expression coverage report more readable
5 participants