fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

andygrove · 2024-05-23T16:22:51Z

Which issue does this PR close?

N/A

Rationale for this change

We have a catchall block in cast.rs that delegates to DataFusion for any cast that we don't have a specific match arm for. This is dangerous because it means we sometimes inadvertently delegate to DataFusion for casts where DataFusion is not compatible with Spark, which can lead to data corruption and hard-to-debug issues such as #383 (comment)

What changes are included in this PR?

This PR introduces specific checks so that we only delegate to DataFusion for specific casts and changes the catchall to return an error.

I also improved handling of dictionary-encoded string arrays so that these are unpacked early on and this would have prevented the issue in #383 (comment)

How are these changes tested?

Existing tests

…ith Spark

core/src/execution/datafusion/expressions/cast.rs

andygrove · 2024-05-23T23:36:55Z

core/src/execution/datafusion/expressions/cast.rs

+                // TODO need to add tests to see if we really do support all
+                // timestamp to timestamp conversions


I filed #467 for adding timestamp to timestamp tests

andygrove · 2024-05-24T00:04:03Z

This is ready for review now @viirya @parthchandra @kazuyukitanimura @huaxingao

andygrove · 2024-05-24T00:05:47Z

core/src/execution/datafusion/expressions/cast.rs

+        let array = match &from_type {
+            DataType::Dictionary(key_type, value_type)
+                if key_type.as_ref() == &DataType::Int32
+                    && (value_type.as_ref() == &DataType::Utf8
+                        || value_type.as_ref() == &DataType::LargeUtf8) =>
+            {
+                cast_with_options(&array, value_type.as_ref(), &CAST_OPTIONS)?
+            }
+            _ => array,
+        };


We were previously unpacking dictionary-encoded string arrays only for string to int and string to date. I just moved it earlier on so that we don't have to handle it specifically for certain casts from string

kazuyukitanimura

A few questions

kazuyukitanimura · 2024-05-24T18:27:20Z

core/src/execution/datafusion/expressions/cast.rs

+                    | DataType::Int64
+                    | DataType::Float32
+                    | DataType::Float64
+                    | DataType::Utf8


So right now, there is not Int8 to Decimal128 cast supported, looks like?

The current code says that datafusion is compatible with Spark for all int types -> decimal:

DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean ... | DataType::Decimal128(_, _)

However, this is actually not correct since DataFusion does not have overflow checks for int32 and int64 -> decimal and is not compatible with Spark. I will look at removing those.

Removing that case causes a test failure:

- scalar subquery *** FAILED *** (8 seconds, 253 milliseconds) Cause: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 410.0 failed 1 times, most recent failure: Lost task 0.0 in stage 410.0 (TID 1286) (192.168.64.23 executor driver): org.apache.comet.CometNativeException: Execution error: Comet Internal Error: Native cast invoked for unsupported cast from Int32 to Decimal128(38, 10)

This test relies on a cast that we do not yet support and enables COMET_CAST_ALLOW_INCOMPATIBLE to allow it. I will revert the last change and add a comment about this

kazuyukitanimura · 2024-05-24T18:33:37Z

core/src/execution/datafusion/expressions/cast.rs

+            DataType::Float32 | DataType::Float64 => matches!(
+                to_type,
+                DataType::Boolean
+                    | DataType::Int8
+                    | DataType::Int16
+                    | DataType::Int32
+                    | DataType::Int64
+                    | DataType::Float32
+                    | DataType::Float64


For Float32/64 to Int8/16/32/64, I saw spark_cast_nonintegral_numeric_to_integral covers them above.
Is this for the case self.eval_mode == EvalMode::Try?

Yes, that is correct.

kazuyukitanimura · 2024-05-24T18:37:04Z

core/src/execution/datafusion/expressions/cast.rs

+                // DataFusion only supports binary data containing valid UTF-8 strings
+                matches!(to_type, DataType::Utf8)
+            }
+            _ => false,


Casting to narrower type like Int64 to Int32 cases are not supported when self.eval_mode == EvalMode::Try?

… list" This reverts commit 340e000.

…ble with Spark (apache#461) * only delegate to DataFusion cast when we know that it is compatible with Spark * add more supported casts * improve support for dictionary-encoded string arrays * clippy * fix merge conflict * fix a regression * fix a regression * fix a regression * fix regression * fix regression * fix regression * remove TODO comment now that issue has been filed * remove cast int32/int64 -> decimal from datafusion compatible list * Revert "remove cast int32/int64 -> decimal from datafusion compatible list" This reverts commit 340e000. * add comment (cherry picked from commit 79431f8)

andygrove added 3 commits May 23, 2024 09:54

only delegate to DataFusion cast when we know that it is compatible w…

3a1c387

…ith Spark

add more supported casts

60354bd

improve support for dictionary-encoded string arrays

35585aa

andygrove mentioned this pull request May 23, 2024

feat: Implement Spark-compatible CAST from String to Date #383

Merged

andygrove added 4 commits May 23, 2024 10:49

clippy

dc4f99c

Merge remote-tracking branch 'apache/main' into cast-df-compat

f8a6f94

fix merge conflict

2b21917

fix a regression

c2c2546

viirya reviewed May 23, 2024

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Show resolved Hide resolved

andygrove added 5 commits May 23, 2024 15:24

fix a regression

7f1951a

fix a regression

c0f913b

fix regression

34edf00

fix regression

54043ae

fix regression

2f22e06

andygrove mentioned this pull request May 23, 2024

Add tests for casting between timestamp types #467

Open

andygrove commented May 23, 2024

View reviewed changes

andygrove commented May 24, 2024

View reviewed changes

remove TODO comment now that issue has been filed

bc18fae

kazuyukitanimura approved these changes May 24, 2024

View reviewed changes

viirya approved these changes May 24, 2024

View reviewed changes

andygrove added 3 commits May 25, 2024 06:53

remove cast int32/int64 -> decimal from datafusion compatible list

340e000

Revert "remove cast int32/int64 -> decimal from datafusion compatible…

7d804f5

… list" This reverts commit 340e000.

add comment

3fb2258

andygrove merged commit 79431f8 into apache:main May 25, 2024
40 checks passed

andygrove deleted the cast-df-compat branch May 25, 2024 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

andygrove commented May 23, 2024 •

edited

Loading

andygrove May 23, 2024

andygrove commented May 24, 2024

andygrove May 24, 2024

kazuyukitanimura left a comment

kazuyukitanimura May 24, 2024

andygrove May 24, 2024 •

edited

Loading

andygrove May 25, 2024

andygrove May 25, 2024

kazuyukitanimura May 24, 2024

andygrove May 25, 2024

kazuyukitanimura May 24, 2024

andygrove May 25, 2024

		// TODO need to add tests to see if we really do support all
		// timestamp to timestamp conversions

fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

fix: Only delegate to DataFusion cast when we know that it is compatible with Spark #461

Conversation

andygrove commented May 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

andygrove commented May 24, 2024

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove May 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented May 23, 2024 •

edited

Loading

andygrove May 24, 2024 •

edited

Loading