Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement Spark-compatible CAST from string to timestamp types #335

Merged
merged 30 commits into from
May 2, 2024

Conversation

vaibhawvipul
Copy link
Contributor

Which issue does this PR close?

Closes #328 .

Rationale for this change

Improve compatibility with Apache Spark

What changes are included in this PR?

Add custom implementation of CAST from string to timestamp rather than delegate to DataFusion

How are these changes tested?

  • added unit test cases

@vaibhawvipul vaibhawvipul changed the title feat: Implement Spark-compatible CAST from string to integral types feat: Implement Spark-compatible CAST from string to timestamp types Apr 27, 2024
@andygrove
Copy link
Member

@vaibhawvipul You can run make format to fix the linting issues that are causing some CI tests to fail

return Ok(None);
}

// Define regex patterns and corresponding parsing functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex approach is a good way to quickly get support for all of the format variations and fix the correctness issue but could also be quite expensive. It would be good to add some criterion benchmarks so that we can understand what performance looks like (does not have to be part of this PR).

@vaibhawvipul
Copy link
Contributor Author

Update -

Working on this bug, there is 8 hours difference in comet vs spark cast. If anyone has any pointer, please let me know.

Screenshot 2024-04-29 at 10 19 56 AM

@andygrove
Copy link
Member

Update -

Working on this bug, there is 8 hours difference in comet vs spark cast. If anyone has any pointer, please let me know.

Screenshot 2024-04-29 at 10 19 56 AM

What happens if you explicitly set the timezone to UTC in the Spark configs?

We may want to fallback to Spark for other timezones for now.

@vaibhawvipul
Copy link
Contributor Author

Thanks @andygrove , setting timestamp as UTC worked!

Any idea about this error?

test("cast string to timestamp") {
    withSQLConf(
      (SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC"),
      (CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key -> "true")) {
      val values = Seq("2020-01-01T12:34:56.123456", "T2").toDF("a")
      castTest(values, DataTypes.TimestampType)
    }
  }
Screenshot 2024-04-29 at 10 23 06 PM

@vaibhawvipul
Copy link
Contributor Author

seems like, test is failing in ANSI mode, debugging that.

@vaibhawvipul
Copy link
Contributor Author

@andygrove

I have enabled a "string to timestamp" cast test case that passes. We are supporting all the timestamps mentioned in the issue.


Screenshot 2024-04-30 at 12 30 35 PM

Input Spark Comet
2020 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01-01 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01-01T12 2020-01-01 12:00:00.0 2020-01-01 12:00:00.0
2020-01-01T12:34 2020-01-01 12:34:00.0 2020-01-01 12:34:00.0
2020-01-01T12:34:56 2020-01-01 12:34:56.0 2020-01-01 12:34:56.0
2020-01-01T12:34:56.123456 2020-01-01 12:34:56.123456 2020-01-01 12:34:56.123456
T2 2024-04-25 02:00:00.0 2024-04-25 02:00:00.0

Just a note -

  • Spark supports some additional inputs like - "213" or "85678". i.e 3 digit and 5 digit years too! This is not being supported right now by Comet in this PR.

@vaibhawvipul vaibhawvipul requested a review from andygrove April 30, 2024 07:09
@andygrove
Copy link
Member

seems like, test is failing in ANSI mode, debugging that.

There is an improved version of the ANSI testing in #351

@andygrove
Copy link
Member

Thanks @vaibhawvipul this is looking great. I will review today.

@andygrove
Copy link
Member

  • Spark supports some additional inputs like - "213" or "85678". i.e 3 digit and 5 digit years too! This is not being supported right now by Comet in this PR.

I would be fine with handling this in a follow on PR

@vaibhawvipul vaibhawvipul requested a review from andygrove May 1, 2024 16:20
@andygrove
Copy link
Member

This is looking great @vaibhawvipul. I think this is close to being ready to merge and then have some follow on issues for remaining items.

I think the one thing I would like to see before merging is that we can run the fuzz test without causing any panics, so just replacing unwraps with error handling.

@vaibhawvipul vaibhawvipul requested a review from andygrove May 2, 2024 14:45
@vaibhawvipul
Copy link
Contributor Author

This is looking great @vaibhawvipul. I think this is close to being ready to merge and then have some follow on issues for remaining items.

I think the one thing I would like to see before merging is that we can run the fuzz test without causing any panics, so just replacing unwraps with error handling.

@andygrove all unwraps are removed, I am handling errors in all of my functions.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vaibhawvipul. I think this is at a good point to merge and then we can file issues for follow up items, such as:

  • Add support for 3 and 5 digit years
  • Testing with try_cast and ANSI
  • Passing all fuzz tests
  • Updating compatibility guide

@andygrove andygrove merged commit 064cb47 into apache:main May 2, 2024
28 checks passed
@vaibhawvipul vaibhawvipul deleted the issue-328 branch May 2, 2024 16:16
@parthchandra
Copy link
Contributor

As a follow up, I would also suggest adding/verifying support for any timezone. (See this test for instance:

withSQLConf(SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu") {
)

Also see - https://docs.rs/arrow-cast/50.0.0/arrow_cast/parse/fn.string_to_datetime.html

@andygrove
Copy link
Member

I filed #376 for the follow on work

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…pache#335)

* casting str to timestamp

* fix format

* fixing failed tests, using char as pattern

* bug fixes

* hangling microsecond

* make format

* bug fixes and core refactor

* format code

* removing print statements

* clippy error

* enabling cast timestamp test case

* code refactor

* comet spark test case

* adding all the supported format in test

* fallback spark when timestamp not utc

* bug fix

* bug fix

* adding an explainer commit

* fix test case

* bug fix

* bug fix

* better error handling for unwrap in fn parse_str_to_time_only_timestamp

* remove unwrap from macro

* improving error handling

* adding tests for invalid inputs

* removed all unwraps from timestamp cast functions

* code format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Spark-compatible CAST from String to Timestamp
3 participants