-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement Spark-compatible CAST from String to Date #383
Conversation
@@ -107,7 +108,23 @@ macro_rules! cast_utf8_to_timestamp { | |||
result | |||
}}; | |||
} | |||
|
|||
macro_rules! cast_utf8_to_date { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why this is a macro and not a function? I see only one usage (maybe I missed something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @parthchandra thanks for the review. I am new to rust (and also my first attempt at an oss contribution).
it was my first stab at this issue and was imitating the code from cast_utf8_to_timestamp. removed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am new to rust
So am I :). Welcome to the community !
{ | ||
Self::spark_cast_int_to_int(&array, self.eval_mode, from_type, to_type)? | ||
} | ||
if self.eval_mode != EvalMode::Try => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like unnecessary re-formatting (multiple places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned this up now with cargo fmt
2a1344a
to
6d57954
Compare
This PR is still in progress. I added support for String to Date32.
Hi @parthchandra @andygrove Can you please review if this is going in the right direction. Questions:
|
@vidyasankarv Yes, I would say this is going it a good direction based on a very quick review. I will try and find more time tomorrow for a deeper look. To answer your questions:
|
@@ -954,13 +993,63 @@ fn parse_str_to_time_only_timestamp(value: &str) -> CometResult<Option<i64>> { | |||
Ok(Some(timestamp)) | |||
} | |||
|
|||
fn date_parser(value: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't familiar with Spark's string to date conversion so I took a look. (https://github.com/apache/spark/blob/9d79ab42b127d1a12164cec260bfbd69f6da8b74/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312)
From the comment the allowed formats are -
* `[+-]yyyy*`
* `[+-]yyyy*-[m]m`
* `[+-]yyyy*-[m]m-[d]d`
* `[+-]yyyy*-[m]m-[d]d `
* `[+-]yyyy*-[m]m-[d]d *`
* `[+-]yyyy*-[m]m-[d]dT*`
I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case.
Also, the max number of digits allowed for the year is 7.
Finally, once you've got the 'day' segment of the date you may have a ' ' or 'T' (you're only handling the latter) and the characters after that are discarded.
It looks to me like Spark's custom implementation might be slightly faster since it manages to achieve the split of the string into segments and the parsing of the digits in a single pass. (also does not need to prepare the parser with the format string). You might want to consider doing the same.
You're pretty close to handling these cases (see my comment in the review) |
hi @parthchandra @andygrove made changes as suggested This PR does not support fuzzy tests in CometCastSuite for
For Ansi mode for any format validation results returns an error, where as other modes return None. @parthchandra Regarding these points
Can you please take another look at the PR. thank you. |
Thanks @vidyasankarv. I plan on carefully reviewing this later today. |
current_segment += 1; | ||
} else { | ||
//increment value of current segment by the next digit | ||
let parsed_value = (b - b'0') as i32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line triggers an overflow panic if b
is less than 0
. It looks like this code assumes that b
is a digit, but with the input 3/
, it failed here on processing /
. Perhaps check that b
is a digit first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove added check for ascii digits and some negative test cases around this in rust code. thank you
castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType) | ||
castTest( | ||
Seq( | ||
"262142-01-01", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some invalid entries here as well so that we can ensure that Comet throws errors for invalid inputs when ANSI is enabled?
Some suggestions:
"",
"0",
"not_a_date",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added additional negative test cases as suggested.
@@ -119,7 +119,7 @@ object CometCast { | |||
Unsupported | |||
case DataTypes.DateType => | |||
// https://github.com/apache/datafusion-comet/issues/327 | |||
Unsupported | |||
Compatible() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it seems we are compatible for most common use cases, we are not 100% compatible, so should add a note here.
Compatible() | |
Compatible(Some("Only supports years between 262143 BC and 262142 AD")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not cover these cases in this PR, we should add a test with ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed Compatible as suggested
also added back an ignore test case for fuzz test as a placeholder
// https://github.com/apache/datafusion-comet/issues/327 | ||
castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep generateStrings
because that covers random value tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generateStrings
method wasnt removed, just the fuzz test for Dates we removed, but now added back the fuzz test for Dates as an ignore test some dates are not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a suggestion for adding the fuzzing back but filtering out values that we know are not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can remove the ignored test if everyone is happy with the suggested changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed now and added a filtered fuzzy test
Hi @andygrove These particular test cases result in failure - like when i run with this sample test However when the ComeTestSuite for Test String to Date - runs it fails for the combination of comet ansi enabled without try_cast. I have tried to concurrently debug on CLION https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/debugging.md - however the breakpoints show us disabled and are nt hitting - i tried switching to lldb in the clion toolchain too but no help. I added some additional logging locally and can see that it returns a CometError for the invalid value on the rust side as expected, but returns None for Comet side when running from the scala test suite. Is there something else I could be missing in terms of any configuration. Appreciate any help when you get some time. Thank you |
} else if let Ok(Some(cast_value)) = | ||
date_parser(string_array.value(i), eval_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vidyasankarv This is the fix you need to make your current test pass.
The problem was that we were ignoring any error here when runnin in ANSI mode.
} else if let Ok(Some(cast_value)) = | |
date_parser(string_array.value(i), eval_mode) | |
} else if let Some(cast_value) = date_parser(string_array.value(i), eval_mode)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry my mistake missed checking here.
handling all cases of return value of date_parser using matching clause now.
thank you @andygrove
"2020-mar-20", | ||
"not_a_date", | ||
"T2") | ||
castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add fuzzing back here, but filter out values that we know that we cannot support.
castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType) | |
// due to limitations of NaiveDate we only support years between 262143 BC and 262142 AD" | |
// we can't test all possible fuzz dates | |
val unsupportedYearPattern: Regex = "^\\s*[0-9]{5,}".r | |
val fuzzDates = generateStrings(datePattern, 8) | |
.filterNot(str => unsupportedYearPattern.findFirstMatchIn(str).isDefined) | |
castTest((validDates ++ invalidDates ++ fuzzDates).toDF("a"), DataTypes.DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the suggestion @andygrove incorporated this as suggested.
@@ -563,8 +565,54 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
castTest(generateStrings(numericPattern, 8).toDF("a"), DataTypes.BinaryType) | |||
} | |||
|
|||
ignore("cast StringType to DateType") { | |||
test("cast StringType to DateType") { | |||
// https://github.com/apache/datafusion-comet/issues/327 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit lets move this comment of the issue number to cast StringType to DataType - Fuzz Test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ignore test has been done away with as per @andygrove 's suggestion above using a filtered fuzz test, so removed this now.
//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate. | ||
fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> { | ||
// local functions | ||
fn get_trimmed_start(bytes: &[u8]) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this defined because we cannot use String#trim()
or trim_matches
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this is a direct port of Spark's logic. Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation). It is possible that the code could be optimized more to take advantage of zero-cost abstractions in Rust, but I think we should look at optimizations as a follow up if we determine that performance needs improving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazuyukitanimura ye this is a port of scala's implementation https://github.com/apache/spark/blob/7e79e91dc8c531ee9135f0e32a9aa2e1f80c4bbf/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312 was suggested @parthchandra in a previous comment.
leaving as is for now based on above comment from @andygrove. hope thats ok with you too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).
effectively a str slice
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #383 +/- ##
============================================
+ Coverage 34.02% 34.17% +0.15%
+ Complexity 857 850 -7
============================================
Files 116 116
Lines 38565 38547 -18
Branches 8517 8523 +6
============================================
+ Hits 13120 13174 +54
+ Misses 22691 22608 -83
- Partials 2754 2765 +11 ☔ View full report in Codecov by Sentry. |
There is one test failure with JDK 8 / Spark 3.2:
|
@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point. You can add an test("cast StringType to DateType") {
assume(CometSparkSessionExtensions.isSpark33Plus) It would be good to add a comment in here as well with a link to the follow on issue (could you file that?) |
@andygrove thank you for suggestions - filed this issue #440 and linked in the test. |
https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2 @andygrove From the logs for ubuntu-latest/java 17-spark-3.4-scala-2.12/ - https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2
Additionally the same sample dates from above pass otherwise without fuzz test https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR577-R585 and the test report also shows them as null on Comet side in the comparison when the fuzz tests are included. https://github.com/apache/datafusion-comet/actions/runs/9123082449/job/25132235801
if you search for similarly if you also search for dates
I have spent a fair time trying to understand why this is happening , but am unable to identify the issue. I have added some more samples from fuzzy dates into my current unit tests in rust tests and CometCastSuite So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @vidyasankarv
//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate. | ||
fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> { | ||
// local functions | ||
fn get_trimmed_start(bytes: &[u8]) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).
effectively a str slice
Thanks @vidyasankarv. I plan on looking into this tomorrow. Overall, the PR looks good. |
@vidyasankarv I am also very confused .. values that fail in the fuzz test work in the other test 🤔 I am debugging and will let you know when I get to the bottom of this mystery |
@vidyasankarv I figured out what the issue is. I don't fully understand why, but when the fuzz test creates the DataFrame, the cast operation that gets performed is from a dictionary array not a string array:
This means that we are not even calling your native date_parser but instead falling through to this catchall logic:
The solution is to add a specific match for casting dictionary to date: (
DataType::Dictionary(key_type, value_type),
DataType::Date32,
) if key_type.as_ref() == &DataType::Int32
&& (value_type.as_ref() == &DataType::Utf8
|| value_type.as_ref() == &DataType::LargeUtf8) =>
{
match value_type.as_ref() {
DataType::Utf8 => {
let unpacked_array =
cast_with_options(&array, &DataType::Utf8, &CAST_OPTIONS)?;
Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
}
DataType::LargeUtf8 => {
let unpacked_array =
cast_with_options(&array, &DataType::LargeUtf8, &CAST_OPTIONS)?;
Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
}
dt => unreachable!(
"{}",
format!("invalid value type {dt} for dictionary-encoded string array")
),
}
}, |
@andygrove thank you very much for looking into this. tested fuzz test with your suggestions and is working now. pushed changes in the latest commit 88af45c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending CI. Thank you for your patience @vidyasankarv.
Once this is merged I will rebase #461 which would have prevented some of the issues we ran into on this PR
@andygrove @parthchandra @kazuyukitanimura thank you for reviews and support in helping me through my first open source contribution. Its been a great learning experience. Still trying to grasp all the new things I learnt from this seemingly simple |
The date_parser was introduced in apache#383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in apache#481
The date_parser was introduced in #383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in #481
* support casting DateType in comet * Use NaiveDate methods for parsing dates and remove regex * remove macro for date parsing * compute correct days since epoch. * Run String to Date test without fuzzy test * port spark string to date processing logic for cast * put in fixes for clippy and scalafix issues. * handle non digit characters when parsing segement bytes. * add note on compatability * add negative tests for String to Date test * simplify byte digit check * propagate error correctly when a date cannot be parsed * add fuzz test with unsupported date filtering * add test for string array cast to date * use UNIX_EPOCH constant from NaiveDateTime * fix cargo clippy error - collapse else if * do not run string to date test on spark-3.2 * do not run string to date test on spark-3.2 * add failing fuzz test dates to tests and remove failing fuzz test * remove unused date pattern * add specific match for casting dictionary to date (cherry picked from commit a7272b9)
The date_parser was introduced in apache#383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in apache#481
Which issue does this PR close?
Closes #327.
Rationale for this change
What changes are included in this PR?
How are these changes tested?