feat: Implement Spark-compatible CAST between integer types #340

ganeshkumar269 · 2024-04-28T20:36:12Z

Which issue does this PR close?

Closes #311

Rationale for this change

What changes are included in this PR?

handled cases where casting b/w ints causes difference in behaviour b/w datafusion and spark. for eg. long to int, int to byte etc.
ignored the cases where the behaviour is normal. for eg. int to long, short to int, all cases of TRY mode.

How are these changes tested?

added corresponding scala test cases.

ganeshkumar269 · 2024-04-29T17:01:32Z

Hi @viirya @andygrove , firstly please let me know if this PR aligns with the expectations on how to fix the issue, if not kindly provide pointers on how I can move in the right direction.

Also reg the errors in the CI pipeline, looks like error message for 3.2 is a bit different compared to 3.3 and 3.4 for overflow cases. so in the rust code I will have to add a check for spark version and return the error message accordingly.
How can I get spark version from the rust side?

viirya · 2024-04-29T19:23:56Z

Also reg the errors in the CI pipeline, looks like error message for 3.2 is a bit different compared to 3.3 and 3.4 for overflow cases. so in the rust code I will have to add a check for spark version and return the error message accordingly.
How can I get spark version from the rust side?

For this kind of case, we tend to return same error message at native side, but handle this difference in Scala tests.

andygrove · 2024-04-30T15:27:01Z

As @viirya said, we can handle the difference in error message format in the scala test (we already have examples of this in the castTest method).

For versions prior to 3.4 perhaps you could just check for the word "overflow" in the error message.

- cast short to byte *** FAILED *** (439 milliseconds)
  "Execution error: [CAST_OVERFLOW] The value 18716S of the type "SMALLINT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." did not contain "asting 18716 to tinyint causes overflow" (CometCastSuite.scala:200)

# Conflicts: # spark/src/test/scala/org/apache/comet/CometCastSuite.scala

ganeshkumar269 · 2024-04-30T20:39:39Z

thanks for the inputs @viirya @andygrove , I have added another check in the assert statement where we compare exception messages.
Do you think this approach looks good? or do I need to separate the "invalid cast" and "overflow" assert statements.

If this change is fine, I will modify the code comments to accommodate "overflow" exception aswell.

andygrove · 2024-04-30T23:01:35Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

@@ -722,7 +746,8 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
              // Comet message is in format `The value 'VALUE' of the type FROM_TYPE cannot be cast to TO_TYPE`
              // We just check that the comet message contains the same invalid value as the Spark message
              val sparkInvalidValue = sparkMessage.substring(sparkMessage.indexOf(':') + 2)


Now that we are handling multiple types of error here, we should probably check if sparkMessage.indexOf(':') returns a non-zero value before trying to use it.

andygrove · 2024-04-30T23:02:38Z

thanks for the inputs @viirya @andygrove , I have added another check in the assert statement where we compare exception messages. Do you think this approach looks good? or do I need to separate the "invalid cast" and "overflow" assert statements.

If this change is fine, I will modify the code comments to accommodate "overflow" exception aswell.

I think this general approach is OK to handle different types of expected errors. I left a specific comment on the code as well.

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

added a check before we fetch the sparkInvalidValue

ganeshkumar269 · 2024-05-01T17:59:08Z

Hi @andygrove , i have added a check before we fetch sparkInvalidValue, defaulting it to EMPTY_STRING if ':' is not present.
Also added additional comments on why we are checking for the presence of 'overflow' string.

andygrove · 2024-05-01T22:25:35Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

+                sparkMessage.substring(sparkMessage.indexOf(':') + 2)
+              }
+              assert(
+                cometMessage.contains(sparkInvalidValue) || cometMessage.contains("overflow"))


If sparkInvalidValue is EMPTY_STRING, won't cometMessage.contains(sparkInvalidValue) always be true?

you are right, my bad 😅 . so incase sparkMessage doesnt have ':' should I assert on just commetMessage.contains("overflow")
something like this,

if sparkMessage.indexOf(':') == -1 then assert(commetMessage.contains("overflow")) else assert(commetMessage.contains(sparkInvalidValue))

Yes, something like that. I haven't reviewed the overflow messages to see if they contain : though (in any of the spark versions 3.2, 3.3, and 3.4)

doesnt look like overflow error message has ':' in it, i ran spark.sql("select cast(9223372036854775807 as int)").show() in my local on various spark versions.

3.4 - [CAST_OVERFLOW] The value 9223372036854775807L of the type "BIGINT" cannot be cast to "INT" due to an overflow. Use try_cast to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error
3.3 - The value 9223372036854775807L of the type "BIGINT" cannot be cast to "INT" due to an overflow. Use try_cast to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error
3.2 - Casting 9223372036854775807 to int causes overflow

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

andygrove · 2024-05-03T18:16:27Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

@@ -665,6 +665,30 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
    castTest(generateTimestamps(), DataTypes.DateType)
  }

+  test("cast short to byte") {


These methods already exist in main but have different naming, so I think you need to upmerge/rebase against main.

Example:

ignore("cast ShortType to ByteType")

andygrove

LGTM. I'll be happy to approve once upmerged/reabsed against main branch

andygrove

LGTM. Thanks @ganeshkumar269

ganeshkumar269 · 2024-05-03T18:36:28Z

thanks @andygrove 🙏🏾 , glad to be a contributor to comet.

andygrove · 2024-05-03T19:17:52Z

@ganeshkumar269 It looks like the error message check needs a little more work. Some tests are failing on Spark 3.3.

This error message does not contain : or overflow. In this specific case, looking for CAST_INVALID_INPUT would be a more robust check.

- cast StringType to LongType *** FAILED *** (416 milliseconds)
  "[CAST_INVALID_INPUT] The value '-9223372036854775809' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." 
did not contain "overflow" (CometCastSuite.scala:881)

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

andygrove · 2024-05-03T20:04:21Z

@ganeshkumar269 My original code for comparing errors in 3.2/3.3 was not very robust. I am also looking at this now to see if I can help improve these checks.

ganeshkumar269 · 2024-05-03T20:39:27Z

@ganeshkumar269 My original code for comparing errors in 3.2/3.3 was not very robust. I am also looking at this now to see if I can help improve these checks.

is there a way I could help here?

andygrove · 2024-05-03T20:42:51Z

@ganeshkumar269 here is my suggestion:

            if (CometSparkSessionExtensions.isSpark34Plus) {
              // for Spark 3.4 we expect to reproduce the error message exactly
              assert(cometMessage == sparkMessage)
            } else if (CometSparkSessionExtensions.isSpark33Plus) {
              // for Spark 3.3 we just need to strip the prefix from the Comet message
              // before comparing
              val cometMessageModified = cometMessage
                .replace("[CAST_INVALID_INPUT] ", "")
                .replace("[CAST_OVERFLOW] ", "")
              assert(cometMessageModified == sparkMessage)
            } else if (CometSparkSessionExtensions.isSpark32) {
              // for Spark 3.2 we just make sure we are seeing a similar type of error
              if (sparkMessage.contains("causes overflow")) {
                assert(cometMessage.contains("due to an overflow"))
              } else {
                // assume that this is an invalid input message in the form:
                // `invalid input syntax for type numeric: -9223372036854775809`
                // we just check that the Comet message contains the same literal value
                val i = sparkMessage.indexOf(':') + 2
                assert(cometMessage.contains(sparkMessage.substring(i)))
              }
            }

I tested this from the command line for all Spark versions using:

mvn test -Pspark-3.2 -DwildcardSuites=org.apache.comet.CometCastSuite -Dspotless.check.skip=true

* handled cast for long to short * handled cast for all overflow cases * ran make format * added check for overflow exception for 3.4 below. * added comments to on why we do overflow check. added a check before we fetch the sparkInvalidValue * -1 instead of 0, -1 indicates the provided character is not present * ran mvn spotless:apply * check for presence of ':' and have asserts accordingly * reusing exising test functions * added one more check in assert when ':' is not present * redo the compare logic as per andy's suggestions. --------- Co-authored-by: ganesh.maddula <[email protected]>

ganesh.maddula added 3 commits April 28, 2024 01:54

handled cast for long to short

54a5f59

handled cast for all overflow cases

8b29641

ran make format

a089552

viirya changed the title ~~Fix/311~~ feat: Implement Spark-compatible CAST between integer types Apr 28, 2024

ganesh.maddula added 2 commits May 1, 2024 01:46

Merge branch 'main' into fix/311

80e3ddf

# Conflicts: # spark/src/test/scala/org/apache/comet/CometCastSuite.scala

added check for overflow exception for 3.4 below.

1ab192a

andygrove reviewed Apr 30, 2024

View reviewed changes

ganesh.maddula added 4 commits May 1, 2024 23:13

Merge branch 'main' into fix/311

e71b417

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

added comments to on why we do overflow check.

a3ef6a3

added a check before we fetch the sparkInvalidValue

-1 instead of 0, -1 indicates the provided character is not present

046985e

ran mvn spotless:apply

6f2306a

andygrove reviewed May 1, 2024

View reviewed changes

ganesh.maddula added 3 commits May 2, 2024 22:59

Merge branch 'main' into fix/311

d5c42f3

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

Merge branch 'main' into fix/311

cefd814

check for presence of ':' and have asserts accordingly

7b9698d

ganeshkumar269 requested a review from andygrove May 3, 2024 17:46

andygrove reviewed May 3, 2024

View reviewed changes

andygrove mentioned this pull request May 3, 2024

feat: Only allow incompatible cast expressions to run in comet if a config is enabled #362

Merged

reusing exising test functions

eaca87d

andygrove approved these changes May 3, 2024

View reviewed changes

Merge branch 'main' into fix/311

11640f9

# Conflicts: # core/src/execution/datafusion/expressions/cast.rs

added one more check in assert when ':' is not present

2da677c

ganeshkumar269 requested a review from andygrove May 3, 2024 19:47

redo the compare logic as per andy's suggestions.

4ee1963

andygrove approved these changes May 3, 2024

View reviewed changes

andygrove merged commit b39ed88 into apache:main May 3, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement Spark-compatible CAST between integer types #340

feat: Implement Spark-compatible CAST between integer types #340

ganeshkumar269 commented Apr 28, 2024

ganeshkumar269 commented Apr 29, 2024

viirya commented Apr 29, 2024

andygrove commented Apr 30, 2024

ganeshkumar269 commented Apr 30, 2024

andygrove Apr 30, 2024

andygrove commented Apr 30, 2024

ganeshkumar269 commented May 1, 2024

andygrove May 1, 2024

ganeshkumar269 May 2, 2024 •

edited

Loading

andygrove May 2, 2024

ganeshkumar269 May 3, 2024 •

edited

Loading

andygrove May 3, 2024

andygrove left a comment

andygrove left a comment

ganeshkumar269 commented May 3, 2024

andygrove commented May 3, 2024

andygrove commented May 3, 2024

ganeshkumar269 commented May 3, 2024

andygrove commented May 3, 2024

feat: Implement Spark-compatible CAST between integer types #340

feat: Implement Spark-compatible CAST between integer types #340

Conversation

ganeshkumar269 commented Apr 28, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

ganeshkumar269 commented Apr 29, 2024

viirya commented Apr 29, 2024

andygrove commented Apr 30, 2024

ganeshkumar269 commented Apr 30, 2024

andygrove Apr 30, 2024

Choose a reason for hiding this comment

andygrove commented Apr 30, 2024

ganeshkumar269 commented May 1, 2024

andygrove May 1, 2024

Choose a reason for hiding this comment

ganeshkumar269 May 2, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove May 2, 2024

Choose a reason for hiding this comment

ganeshkumar269 May 3, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove May 3, 2024

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

ganeshkumar269 commented May 3, 2024

andygrove commented May 3, 2024

andygrove commented May 3, 2024

ganeshkumar269 commented May 3, 2024

andygrove commented May 3, 2024

ganeshkumar269 May 2, 2024 •

edited

Loading

ganeshkumar269 May 3, 2024 •

edited

Loading