[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

yaooqinn · 2024-11-27T09:26:12Z

What changes were proposed in this pull request?

A backward-compatible approach for #45039 to make older versions of spark properly read struct-typed columns created by spark 4.x or later with special characters.

Compared with #45039, only the datasource tables are supported now, as we have a special way to store hive incompatible schema to the table properties. This is a safe removal because we don't have any release to support that.

Why are the changes needed?

backward-compatibility improvement

Does this PR introduce any user-facing change?

Users can store/read struct-typed columns with special characters.

How was this patch tested?

tests provided by SPARK-22431

DDLSuite.scala:  test("SPARK-22431: table with nested type col with special char") 
DDLSuite.scala:  test("SPARK-22431: view with nested type") 
HiveDDLSuite.scala:  test("SPARK-22431: table with nested type") {
HiveDDLSuite.scala:  test("SPARK-22431: view with nested type") {
HiveDDLSuite.scala:  test("SPARK-22431: alter table tests with nested types") {

tests provided by the previous PR towards SPARK-46934

HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: HMS columns cannot handle quoted columns") 
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types") {
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types with CTAS") {
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types with hive DDL") {
HiveDDLSuite.scala:  test("SPARK-46934: quote element name before parsing struct") {
HiveDDLSuite.scala:  test("SPARK-46934: alter table tests with nested types") {

manually backward compatibility test

create a tarball with the current revison
cd dist
using spark-sql to mock data

spark-sql (default)> CREATE TABLE t AS SELECT named_struct('a.b.b', array('a'), 'a b c', map(1, 'a')) AS `a.b`;

copy metadata to 3.5.3 release

cp -r ~/spark/dist/metastore_db .

Fix derby version restrictions

rm jars/derby-10.14.2.0.jar
cp -r ~/spark/dist/jars/derby-10.16.1.1.jar ./jars

read data

spark-sql (default)> select version();
6.5.3 32232e9ed33bb16b93ad58cfde8b82e0f07c0970
Time taken: 0.103 seconds, Fetched 1 row(s)
spark-sql (default)> select * from t;
{"a.b.b":["a"],"a b c":{1:"a"}}
Time taken: 0.09 seconds, Fetched 1 row(s)
spark-sql (default)> desc formatted t;
a.b                 	struct<a.b.b:array<string>,a b c:map<int,string>>

# Detailed Table Information
Catalog             	spark_catalog
Database            	default
Table               	t
Owner               	hzyaoqin
Created Time        	Wed Nov 27 17:40:53 CST 2024
Last Access         	UNKNOWN
Created By          	Spark 4.0.0-SNAPSHOT
Type                	MANAGED
Provider            	parquet
Statistics          	1245 bytes
Location            	file:/Users/hzyaoqin/spark/dist/spark-warehouse/t
Serde Library       	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat         	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat        	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Time taken: 0.054 seconds, Fetched 17 row(s)

Was this patch authored or co-authored using generative AI tooling?

No

yaooqinn · 2024-11-27T10:26:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -1130,10 +1133,6 @@ private[hive] object HiveClientImpl extends Logging {
    Option(hc.getComment).map(field.withComment).getOrElse(field)
  }

-  private def verifyColumnDataType(schema: StructType): Unit = {


This is a misleading step for the write path, the schema here is both produced and verified by Spark itself but reports a CANNOT_RECOGNIZE_HIVE_TYPE error to us

After this change by quoting, this check is always positive

yaooqinn · 2024-11-27T11:08:01Z

sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala

@@ -433,7 +433,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
    stringConcat.append("struct<")
    var i = 0
    while (i < len) {
-      stringConcat.append(s"${fields(i).name}:${fields(i).dataType.catalogString}")
+      val name = QuotingUtils.quoteIfNeeded(fields(i).name)
+      stringConcat.append(s"$name:${fields(i).dataType.catalogString}")


cc @zsxwing, this seems also affect output schema of queries

Also, cc @cloud-fan. If this is the right option to go with, shall we create a variant for the catalogString method to minimize the impact?

FYI, the previous discussion thread can be visited at #45039 (comment)

[SPARK-46934][SQL][FOLLOWUP] Handle special characters in struct types

249222c

github-actions bot added the SQL label Nov 27, 2024

yaooqinn commented Nov 27, 2024

View reviewed changes

[SPARK-46934][SQL][FOLLOWUP] Handle special characters in struct types

cb44072

yaooqinn commented Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

yaooqinn commented Nov 27, 2024 •

edited

Loading

yaooqinn Nov 27, 2024

yaooqinn Nov 27, 2024

yaooqinn Nov 27, 2024

yaooqinn Nov 27, 2024

yaooqinn Nov 27, 2024

[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

Are you sure you want to change the base?

[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

Conversation

yaooqinn commented Nov 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tests provided by SPARK-22431

tests provided by the previous PR towards SPARK-46934

manually backward compatibility test

Was this patch authored or co-authored using generative AI tooling?

yaooqinn Nov 27, 2024

Choose a reason for hiding this comment

yaooqinn Nov 27, 2024

Choose a reason for hiding this comment

yaooqinn Nov 27, 2024

Choose a reason for hiding this comment

yaooqinn Nov 27, 2024

Choose a reason for hiding this comment

yaooqinn Nov 27, 2024

Choose a reason for hiding this comment

yaooqinn commented Nov 27, 2024 •

edited

Loading