-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement Spark-compatible CAST between integer types #340
Conversation
Hi @viirya @andygrove , firstly please let me know if this PR aligns with the expectations on how to fix the issue, if not kindly provide pointers on how I can move in the right direction. Also reg the errors in the CI pipeline, looks like error message for 3.2 is a bit different compared to 3.3 and 3.4 for overflow cases. so in the rust code I will have to add a check for spark version and return the error message accordingly. |
For this kind of case, we tend to return same error message at native side, but handle this difference in Scala tests. |
As @viirya said, we can handle the difference in error message format in the scala test (we already have examples of this in the For versions prior to 3.4 perhaps you could just check for the word "overflow" in the error message.
|
# Conflicts: # spark/src/test/scala/org/apache/comet/CometCastSuite.scala
thanks for the inputs @viirya @andygrove , I have added another check in the assert statement where we compare exception messages. If this change is fine, I will modify the code comments to accommodate "overflow" exception aswell. |
@@ -722,7 +746,8 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
// Comet message is in format `The value 'VALUE' of the type FROM_TYPE cannot be cast to TO_TYPE` | |||
// We just check that the comet message contains the same invalid value as the Spark message | |||
val sparkInvalidValue = sparkMessage.substring(sparkMessage.indexOf(':') + 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are handling multiple types of error here, we should probably check if sparkMessage.indexOf(':')
returns a non-zero value before trying to use it.
I think this general approach is OK to handle different types of expected errors. I left a specific comment on the code as well. |
# Conflicts: # core/src/execution/datafusion/expressions/cast.rs
added a check before we fetch the sparkInvalidValue
Hi @andygrove , i have added a check before we fetch sparkInvalidValue, defaulting it to EMPTY_STRING if ':' is not present. |
sparkMessage.substring(sparkMessage.indexOf(':') + 2) | ||
} | ||
assert( | ||
cometMessage.contains(sparkInvalidValue) || cometMessage.contains("overflow")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sparkInvalidValue
is EMPTY_STRING
, won't cometMessage.contains(sparkInvalidValue)
always be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, my bad 😅 . so incase sparkMessage doesnt have ':' should I assert on just commetMessage.contains("overflow")
something like this,
if sparkMessage.indexOf(':') == -1 then assert(commetMessage.contains("overflow"))
else assert(commetMessage.contains(sparkInvalidValue))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that. I haven't reviewed the overflow messages to see if they contain :
though (in any of the spark versions 3.2, 3.3, and 3.4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt look like overflow error message has ':' in it, i ran spark.sql("select cast(9223372036854775807 as int)").show() in my local on various spark versions.
3.4 - [CAST_OVERFLOW] The value 9223372036854775807L of the type "BIGINT" cannot be cast to "INT" due to an overflow. Use try_cast
to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error
3.3 - The value 9223372036854775807L of the type "BIGINT" cannot be cast to "INT" due to an overflow. Use try_cast
to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error
3.2 - Casting 9223372036854775807 to int causes overflow
# Conflicts: # core/src/execution/datafusion/expressions/cast.rs
@@ -665,6 +665,30 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
castTest(generateTimestamps(), DataTypes.DateType) | |||
} | |||
|
|||
test("cast short to byte") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods already exist in main but have different naming, so I think you need to upmerge/rebase against main.
Example:
ignore("cast ShortType to ByteType")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll be happy to approve once upmerged/reabsed against main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @ganeshkumar269
thanks @andygrove 🙏🏾 , glad to be a contributor to comet. |
@ganeshkumar269 It looks like the error message check needs a little more work. Some tests are failing on Spark 3.3. This error message does not contain
|
# Conflicts: # core/src/execution/datafusion/expressions/cast.rs
@ganeshkumar269 My original code for comparing errors in 3.2/3.3 was not very robust. I am also looking at this now to see if I can help improve these checks. |
is there a way I could help here? |
@ganeshkumar269 here is my suggestion: if (CometSparkSessionExtensions.isSpark34Plus) {
// for Spark 3.4 we expect to reproduce the error message exactly
assert(cometMessage == sparkMessage)
} else if (CometSparkSessionExtensions.isSpark33Plus) {
// for Spark 3.3 we just need to strip the prefix from the Comet message
// before comparing
val cometMessageModified = cometMessage
.replace("[CAST_INVALID_INPUT] ", "")
.replace("[CAST_OVERFLOW] ", "")
assert(cometMessageModified == sparkMessage)
} else if (CometSparkSessionExtensions.isSpark32) {
// for Spark 3.2 we just make sure we are seeing a similar type of error
if (sparkMessage.contains("causes overflow")) {
assert(cometMessage.contains("due to an overflow"))
} else {
// assume that this is an invalid input message in the form:
// `invalid input syntax for type numeric: -9223372036854775809`
// we just check that the Comet message contains the same literal value
val i = sparkMessage.indexOf(':') + 2
assert(cometMessage.contains(sparkMessage.substring(i)))
}
} I tested this from the command line for all Spark versions using:
|
* handled cast for long to short * handled cast for all overflow cases * ran make format * added check for overflow exception for 3.4 below. * added comments to on why we do overflow check. added a check before we fetch the sparkInvalidValue * -1 instead of 0, -1 indicates the provided character is not present * ran mvn spotless:apply * check for presence of ':' and have asserts accordingly * reusing exising test functions * added one more check in assert when ':' is not present * redo the compare logic as per andy's suggestions. --------- Co-authored-by: ganesh.maddula <[email protected]>
Which issue does this PR close?
Closes #311
Rationale for this change
What changes are included in this PR?
How are these changes tested?
added corresponding scala test cases.