feat: Implement Spark-compatible CAST from String to Date #383

vidyasankarv · 2024-05-05T18:26:15Z

Which issue does this PR close?

Closes #327.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

parthchandra · 2024-05-06T16:07:16Z

core/src/execution/datafusion/expressions/cast.rs

@@ -107,7 +108,23 @@ macro_rules! cast_utf8_to_timestamp {
        result
    }};
 }
-
+macro_rules! cast_utf8_to_date {


Any reason why this is a macro and not a function? I see only one usage (maybe I missed something).

Hi @parthchandra thanks for the review. I am new to rust (and also my first attempt at an oss contribution).
it was my first stab at this issue and was imitating the code from cast_utf8_to_timestamp. removed it now.

I am new to rust

So am I :). Welcome to the community !

parthchandra · 2024-05-06T16:08:15Z

core/src/execution/datafusion/expressions/cast.rs

-            {
-                Self::spark_cast_int_to_int(&array, self.eval_mode, from_type, to_type)?
-            }
+            if self.eval_mode != EvalMode::Try =>


Seems like unnecessary re-formatting (multiple places).

cleaned this up now with cargo fmt

vidyasankarv · 2024-05-08T14:33:28Z

This PR is still in progress. I added support for String to Date32.

Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now
Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now
DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

Need some pointers on fuzz tests. If I understand correctly
- "-0973250", "-3638-5" fuzz tests in Legacy mode should return values as mentioned Implement Spark-compatible CAST from String to Date #327 - currently legacy mode returns null.
Do we have to support Date64 as well ?

andygrove · 2024-05-08T16:23:21Z

This PR is still in progress. I added support for String to Date32.

Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now

Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now

DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

Need some pointers on fuzz tests. If I understand correctly

"-0973250", "-3638-5" fuzz tests in Legacy mode should return values as mentioned Implement Spark-compatible CAST from String to Date #327 - currently legacy mode returns null.

Do we have to support Date64 as well ?

@vidyasankarv Yes, I would say this is going it a good direction based on a very quick review. I will try and find more time tomorrow for a deeper look. To answer your questions:

Yes, ideally we should support the edge cases mentioned in the issue. We could also choose to leave that for a future PR and leave the current support marked as incompatible and provide some documentation on what is not supported (as we have done for string -> timestamp).
No need to support Date64 for this issue

parthchandra · 2024-05-08T17:27:24Z

core/src/execution/datafusion/expressions/cast.rs

@@ -954,13 +993,63 @@ fn parse_str_to_time_only_timestamp(value: &str) -> CometResult<Option<i64>> {
    Ok(Some(timestamp))
 }

+fn date_parser(value: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {


I wasn't familiar with Spark's string to date conversion so I took a look. (https://github.com/apache/spark/blob/9d79ab42b127d1a12164cec260bfbd69f6da8b74/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312)
From the comment the allowed formats are -

* `[+-]yyyy*` * `[+-]yyyy*-[m]m` * `[+-]yyyy*-[m]m-[d]d` * `[+-]yyyy*-[m]m-[d]d ` * `[+-]yyyy*-[m]m-[d]d *` * `[+-]yyyy*-[m]m-[d]dT*`

I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case.
Also, the max number of digits allowed for the year is 7.
Finally, once you've got the 'day' segment of the date you may have a ' ' or 'T' (you're only handling the latter) and the characters after that are discarded.
It looks to me like Spark's custom implementation might be slightly faster since it manages to achieve the split of the string into segments and the parsing of the digits in a single pass. (also does not need to prepare the parser with the format string). You might want to consider doing the same.

parthchandra · 2024-05-08T17:30:02Z

"-0973250", "-3638-5" fuzz tests in Legacy mode should return values as mentioned Implement Spark-compatible CAST from String to Date #327 - currently legacy mode returns null.

You're pretty close to handling these cases (see my comment in the review)

vidyasankarv · 2024-05-11T12:23:32Z

hi @parthchandra @andygrove made changes as suggested
ported the date parsing logic from SparkDateTimeUtils -
The previous one reads much simpler - though it was missing a couple of features like handling trialing spaces like extra spaces diff

This PR does not support fuzzy tests in CometCastSuite for test Date to String as Naive Date only supports dates in the below range and the dates generated by fuzzy test for matching with spark are out of this range and hence cause a mismatch between results.

/// The minimum possible `Na iveDate` (January 1, 262145 BCE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MIN instead")]
pub const MIN_DATE: NaiveDate = NaiveDate::MIN;
/// The maximum possible `NaiveDate` (December 31, 262143 CE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MAX instead")]
pub const MAX_DATE: NaiveDate = NaiveDate::MAX;

For Ansi mode for any format validation results returns an error, where as other modes return None.
However when all validations pass, for dates that are beyond dates supported by NaiveDate all modes return None.

@parthchandra Regarding these points

I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case - found something here
- If the year is between 9999 BCE and 9999 CE, the result is of the form -YYYY-MM-DD and YYYY-MM-DD respectively. For years prior or after this range, the necessary number of digits are added to the year component and + is used for CE.
Also, the max number of digits allowed for the year is 7 - supporting daes upto 7 numbers would be beyond NaiveDates support - so this will be point of difference between native spark and spark with comet

Can you please take another look at the PR. thank you.

andygrove · 2024-05-13T12:30:15Z

Thanks @vidyasankarv. I plan on carefully reviewing this later today.

andygrove · 2024-05-13T15:24:39Z

core/src/execution/datafusion/expressions/cast.rs

+            current_segment += 1;
+        } else {
+            //increment value of current segment by the next digit
+            let parsed_value = (b - b'0') as i32;


This line triggers an overflow panic if b is less than 0. It looks like this code assumes that b is a digit, but with the input 3/, it failed here on processing /. Perhaps check that b is a digit first?

@andygrove added check for ascii digits and some negative test cases around this in rust code. thank you

andygrove · 2024-05-13T15:30:03Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

-    castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType)
+    castTest(
+      Seq(
+        "262142-01-01",


Could you add some invalid entries here as well so that we can ensure that Comet throws errors for invalid inputs when ANSI is enabled?

Some suggestions:

"", "0", "not_a_date",

added additional negative test cases as suggested.

andygrove · 2024-05-13T15:39:56Z

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala

@@ -119,7 +119,7 @@ object CometCast {
        Unsupported
      case DataTypes.DateType =>
        // https://github.com/apache/datafusion-comet/issues/327
-        Unsupported
+        Compatible()


Although it seems we are compatible for most common use cases, we are not 100% compatible, so should add a note here.

Suggested change

Compatible()

Compatible(Some("Only supports years between 262143 BC and 262142 AD"))

If we do not cover these cases in this PR, we should add a test with ignore

changed Compatible as suggested
also added back an ignore test case for fuzz test as a placeholder

kazuyukitanimura · 2024-05-13T18:41:09Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

    // https://github.com/apache/datafusion-comet/issues/327
-    castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType)


Should we keep generateStrings because that covers random value tests?

generateStrings method wasnt removed, just the fuzz test for Dates we removed, but now added back the fuzz test for Dates as an ignore test some dates are not supported.

I left a suggestion for adding the fuzzing back but filtering out values that we know are not supported

I think that we can remove the ignored test if everyone is happy with the suggested changes

removed now and added a filtered fuzzy test

vidyasankarv · 2024-05-14T19:24:20Z

Hi @andygrove
I need some help
https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR593

These particular test cases result in failure - like when i run with this sample test 2020-10-010T - the same test in rust returns CometError in ansi mode https://github.com/apache/datafusion-comet/pull/383/files#diff-b7339cca414a6315488506dd33654946f62c229feb8ad0d4abeda683ca75b4b5R1717

However when the ComeTestSuite for Test String to Date - runs it fails for the combination of comet ansi enabled without try_cast.

I have tried to concurrently debug on CLION https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/debugging.md - however the breakpoints show us disabled and are nt hitting - i tried switching to lldb in the clion toolchain too but no help. I added some additional logging locally and can see that it returns a CometError for the invalid value on the rust side as expected, but returns None for Comet side when running from the scala test suite.

Is there something else I could be missing in terms of any configuration. Appreciate any help when you get some time.

Thank you

andygrove · 2024-05-14T22:26:48Z

core/src/execution/datafusion/expressions/cast.rs

+                    } else if let Ok(Some(cast_value)) =
+                        date_parser(string_array.value(i), eval_mode)


@vidyasankarv This is the fix you need to make your current test pass.

The problem was that we were ignoring any error here when runnin in ANSI mode.

Suggested change

} else if let Ok(Some(cast_value)) =

date_parser(string_array.value(i), eval_mode)

} else if let Some(cast_value) = date_parser(string_array.value(i), eval_mode)?

oh sorry my mistake missed checking here.
handling all cases of return value of date_parser using matching clause now.
thank you @andygrove

andygrove · 2024-05-14T22:39:32Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

+      "2020-mar-20",
+      "not_a_date",
+      "T2")
+    castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType)


Let's add fuzzing back here, but filter out values that we know that we cannot support.

Suggested change

castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType)

// due to limitations of NaiveDate we only support years between 262143 BC and 262142 AD"

// we can't test all possible fuzz dates

val unsupportedYearPattern: Regex = "^\\s*[0-9]{5,}".r

val fuzzDates = generateStrings(datePattern, 8)

.filterNot(str => unsupportedYearPattern.findFirstMatchIn(str).isDefined)

castTest((validDates ++ invalidDates ++ fuzzDates).toDF("a"), DataTypes.DateType)

thank you for the suggestion @andygrove incorporated this as suggested.

kazuyukitanimura · 2024-05-14T20:14:39Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

@@ -563,8 +565,54 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
    castTest(generateStrings(numericPattern, 8).toDF("a"), DataTypes.BinaryType)
  }

-  ignore("cast StringType to DateType") {
+  test("cast StringType to DateType") {
    // https://github.com/apache/datafusion-comet/issues/327


nit lets move this comment of the issue number to cast StringType to DataType - Fuzz Test

the ignore test has been done away with as per @andygrove 's suggestion above using a filtered fuzz test, so removed this now.

kazuyukitanimura · 2024-05-14T23:48:27Z

core/src/execution/datafusion/expressions/cast.rs

+//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate.
+fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
+    // local functions
+    fn get_trimmed_start(bytes: &[u8]) -> usize {


Is this defined because we cannot use String#trim() or trim_matches?

My understanding is that this is a direct port of Spark's logic. Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation). It is possible that the code could be optimized more to take advantage of zero-cost abstractions in Rust, but I think we should look at optimizations as a follow up if we determine that performance needs improving.

@kazuyukitanimura ye this is a port of scala's implementation https://github.com/apache/spark/blob/7e79e91dc8c531ee9135f0e32a9aa2e1f80c4bbf/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312 was suggested @parthchandra in a previous comment.
leaving as is for now based on above comment from @andygrove. hope thats ok with you too.

Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).

effectively a str slice

codecov-commenter · 2024-05-15T14:35:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.17%. Comparing base (14494d3) to head (ae575e5).
Report is 11 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #383      +/-   ##
============================================
+ Coverage     34.02%   34.17%   +0.15%     
+ Complexity      857      850       -7     
============================================
  Files           116      116              
  Lines         38565    38547      -18     
  Branches       8517     8523       +6     
============================================
+ Hits          13120    13174      +54     
+ Misses        22691    22608      -83     
- Partials       2754     2765      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2024-05-15T17:03:44Z

There is one test failure with JDK 8 / Spark 3.2:

- cast StringType to DateType *** FAILED *** (349 milliseconds)
  "[CAST_INVALID_INPUT] The value '0' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." did not contain "annot cast 0 to DateType." (CometCastSuite.scala:1069)

andygrove · 2024-05-16T14:03:34Z

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:

  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)

It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

vidyasankarv · 2024-05-17T04:45:54Z

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:
  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)
It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

@andygrove thank you for suggestions - filed this issue #440 and linked in the test.

vidyasankarv · 2024-05-20T13:10:49Z

https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

@andygrove
This build included the fuzz test cast String to DateType - 19fa952 - as recommended here #383 (comment)

From the logs for ubuntu-latest/java 17-spark-3.4-scala-2.12/ - https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

all return cast values on comet side are showing null.

Additionally the same sample dates from above pass otherwise without fuzz test https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR577-R585 and the test report also shows them as null on Comet side in the comparison when the fuzz tests are included.

https://github.com/apache/datafusion-comet/actions/runs/9123082449/job/25132235801

  ![262142-01-01,2142-01-01]               [262142-01-01,null]
  ![262142-01-01 ,2142-01-01]              [262142-01-01 ,null]
  ![262142-01-01T ,2142-01-01]             [262142-01-01T ,null]
  ![262142-01-01T 123123123,2142-01-01]    [262142-01-01T 123123123,null]
   [263,null]                              [263,null]

if you search for 262142-01-01 in the logs you can see it reports as failing on lines 14035 to 14038 as above

similarly if you also search for dates -262143-12-31 on lines 10167 to 10171

   [--262143-12-31,null]                   [--262143-12-31,null]
   [--262143-12-31T 1234 ,null]            [--262143-12-31T 1234 ,null]
  ![-262143-12-31,2144-12-31]              [-262143-12-31,null]
  ![-262143-12-31 ,2144-12-31]             [-262143-12-31 ,null]
  ![-262143-12-31T,2144-12-31]             [-262143-12-31T,null]
  ![-262143-12-31T ,2144-12-31]            [-262143-12-31T ,null]
  ![-262143-12-31T 123123123,2144-12-31]   [-262143-12-31T 123123123,null]

I have spent a fair time trying to understand why this is happening , but am unable to identify the issue.

I have added some more samples from fuzzy dates into my current unit tests in rust tests and CometCastSuite
all of them are passing without the fuzz test locally.
2b4c204

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test.
Apologies for taking your time on this again. Thank you for your help

parthchandra

LGTM
Thanks @vidyasankarv

parthchandra · 2024-05-22T21:25:18Z

core/src/execution/datafusion/expressions/cast.rs

+//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate.
+fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
+    // local functions
+    fn get_trimmed_start(bytes: &[u8]) -> usize {


Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).

effectively a str slice

andygrove · 2024-05-23T07:39:10Z

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test.
Apologies for taking your time on this again. Thank you for your help

Thanks @vidyasankarv. I plan on looking into this tomorrow. Overall, the PR looks good.

andygrove · 2024-05-23T14:54:57Z

@vidyasankarv I am also very confused .. values that fail in the fuzz test work in the other test 🤔

I am debugging and will let you know when I get to the bottom of this mystery

andygrove · 2024-05-23T15:24:18Z

@vidyasankarv I figured out what the issue is.

I don't fully understand why, but when the fuzz test creates the DataFrame, the cast operation that gets performed is from a dictionary array not a string array:

cast_array(from=Dictionary(Int32, Utf8), to_type=Date32)

This means that we are not even calling your native date_parser but instead falling through to this catchall logic:

_ => {
    // when we have no Spark-specific casting we delegate to DataFusion
    cast_with_options(&array, to_type, &CAST_OPTIONS)?
}

The solution is to add a specific match for casting dictionary to date:

            (
                DataType::Dictionary(key_type, value_type),
                DataType::Date32,
            ) if key_type.as_ref() == &DataType::Int32
                && (value_type.as_ref() == &DataType::Utf8
                || value_type.as_ref() == &DataType::LargeUtf8) =>
            {
                match value_type.as_ref() {
                    DataType::Utf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::Utf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    DataType::LargeUtf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::LargeUtf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    dt => unreachable!(
                        "{}",
                        format!("invalid value type {dt} for dictionary-encoded string array")
                    ),
                }
            },

vidyasankarv · 2024-05-23T16:16:24Z

@andygrove thank you very much for looking into this. tested fuzz test with your suggestions and is working now. pushed changes in the latest commit 88af45c.

andygrove

LGTM pending CI. Thank you for your patience @vidyasankarv.

Once this is merged I will rebase #461 which would have prevented some of the issues we ran into on this PR

vidyasankarv · 2024-05-23T17:46:04Z

@andygrove @parthchandra @kazuyukitanimura thank you for reviews and support in helping me through my first open source contribution. Its been a great learning experience. Still trying to grasp all the new things I learnt from this seemingly simple good first issue. My first exposure to rust, seeing JNI in action for interactions between spark and comet using Arrow. Hope to keep contributing. And waiting for @andygrove's updated version of his How Query Engine's Work . Thank you all.

The date_parser was introduced in apache#383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in apache#481

The date_parser was introduced in #383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in #481

* support casting DateType in comet * Use NaiveDate methods for parsing dates and remove regex * remove macro for date parsing * compute correct days since epoch. * Run String to Date test without fuzzy test * port spark string to date processing logic for cast * put in fixes for clippy and scalafix issues. * handle non digit characters when parsing segement bytes. * add note on compatability * add negative tests for String to Date test * simplify byte digit check * propagate error correctly when a date cannot be parsed * add fuzz test with unsupported date filtering * add test for string array cast to date * use UNIX_EPOCH constant from NaiveDateTime * fix cargo clippy error - collapse else if * do not run string to date test on spark-3.2 * do not run string to date test on spark-3.2 * add failing fuzz test dates to tests and remove failing fuzz test * remove unused date pattern * add specific match for casting dictionary to date (cherry picked from commit a7272b9)

The date_parser was introduced in apache#383 and is mostly a direct port of code in Spark. Since the code uses the JVM it has defined integer overflow as wrapping. The proposed fixed is to use std::num::Wrapping to get the same wrapping behavior in rust. The overflown value will still be disgarded in a later check that uses `current_segment_digits` so allowing the overflow does not lead to correctness issues. This resolves one of the overflows discussed in apache#481

andygrove changed the title ~~support casting DateType in comet~~ feat: Implement Spark-compatible CAST from String to Date May 5, 2024

parthchandra reviewed May 6, 2024

View reviewed changes

vidyasankarv force-pushed the #327 branch 3 times, most recently from 2a1344a to 6d57954 Compare May 8, 2024 14:20

parthchandra reviewed May 8, 2024

View reviewed changes

vidyasankarv force-pushed the #327 branch from db8e0ab to ace7456 Compare May 11, 2024 12:11

vidyasankarv marked this pull request as ready for review May 11, 2024 12:53

vidyasankarv force-pushed the #327 branch from 0116239 to 47115f0 Compare May 12, 2024 10:01

andygrove reviewed May 13, 2024

View reviewed changes

kazuyukitanimura reviewed May 13, 2024

View reviewed changes

andygrove reviewed May 14, 2024

View reviewed changes

kazuyukitanimura reviewed May 14, 2024

View reviewed changes

vidyasankarv added 5 commits May 17, 2024 09:47

support casting DateType in comet

8b72315

Use NaiveDate methods for parsing dates and remove regex

8e20cb3

remove macro for date parsing

d398c16

compute correct days since epoch.

c33c739

Run String to Date test without fuzzy test

d4c26ff

vidyasankarv added 7 commits May 17, 2024 09:47

simplify byte digit check

e71cb07

propagate error correctly when a date cannot be parsed

736f05d

add fuzz test with unsupported date filtering

2e33646

add test for string array cast to date

8bf5f09

use UNIX_EPOCH constant from NaiveDateTime

a439490

fix cargo clippy error - collapse else if

7708a24

do not run string to date test on spark-3.2

45c25ef

vidyasankarv force-pushed the #327 branch from ae575e5 to 45c25ef Compare May 17, 2024 04:18

vidyasankarv mentioned this pull request May 17, 2024

Cast String to Date ANSI Mode - Spark 3.2 - Mismatch between Spark and Comet Errors #440

Closed

vidyasankarv added 2 commits May 17, 2024 10:16

do not run string to date test on spark-3.2

19fa952

add failing fuzz test dates to tests and remove failing fuzz test

2b4c204

remove unused date pattern

9211af5

parthchandra approved these changes May 22, 2024

View reviewed changes

add specific match for casting dictionary to date

88af45c

andygrove mentioned this pull request May 23, 2024

fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

Merged

andygrove approved these changes May 23, 2024

View reviewed changes

andygrove merged commit a7272b9 into apache:main May 23, 2024
40 checks passed

vidyasankarv deleted the #327 branch May 23, 2024 17:41

eejbyfeldt mentioned this pull request Jun 6, 2024

fix: Fix integer overflow in date_parser #529

Merged

	Compatible()
	Compatible(Some("Only supports years between 262143 BC and 262142 AD"))

		// https://github.com/apache/datafusion-comet/issues/327
		castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType)

		} else if let Ok(Some(cast_value)) =
		date_parser(string_array.value(i), eval_mode)

	} else if let Ok(Some(cast_value)) =
	date_parser(string_array.value(i), eval_mode)
	} else if let Some(cast_value) = date_parser(string_array.value(i), eval_mode)?

-    castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType)
+    // due to limitations of NaiveDate we only support years between 262143 BC and 262142 AD"
+    // we can't test all possible fuzz dates
+    val unsupportedYearPattern: Regex = "^\\s*[0-9]{5,}".r
+    val fuzzDates = generateStrings(datePattern, 8)
+      .filterNot(str => unsupportedYearPattern.findFirstMatchIn(str).isDefined)
+    castTest((validDates ++ invalidDates ++ fuzzDates).toDF("a"), DataTypes.DateType)

feat: Implement Spark-compatible CAST from String to Date #383

feat: Implement Spark-compatible CAST from String to Date #383

Conversation

vidyasankarv commented May 5, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

vidyasankarv May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasankarv commented May 8, 2024 • edited Loading

andygrove commented May 8, 2024 • edited Loading

Choose a reason for hiding this comment

parthchandra commented May 8, 2024

vidyasankarv commented May 11, 2024 • edited Loading

andygrove commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasankarv commented May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasankarv May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 15, 2024

Codecov Report

andygrove commented May 15, 2024

andygrove commented May 16, 2024

vidyasankarv commented May 17, 2024

vidyasankarv commented May 20, 2024 • edited Loading

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented May 23, 2024

andygrove commented May 23, 2024

andygrove commented May 23, 2024

vidyasankarv commented May 23, 2024

andygrove left a comment

Choose a reason for hiding this comment

vidyasankarv commented May 23, 2024 • edited Loading

vidyasankarv commented May 5, 2024 •

edited

Loading

vidyasankarv May 8, 2024 •

edited

Loading

vidyasankarv commented May 8, 2024 •

edited

Loading

andygrove commented May 8, 2024 •

edited

Loading

vidyasankarv commented May 11, 2024 •

edited

Loading

vidyasankarv commented May 14, 2024 •

edited

Loading

vidyasankarv May 15, 2024 •

edited

Loading

vidyasankarv commented May 20, 2024 •

edited

Loading

vidyasankarv commented May 23, 2024 •

edited

Loading