Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-46934][SQL][FOLLOWUP] Read/write roundtrip for struct type with special characters with HMS - a backward compatible approach #48986

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Nov 27, 2024

What changes were proposed in this pull request?

A backward-compatible approach for #45039 to make older versions of spark properly read struct-typed columns created by spark 4.x or later with special characters.

Compared with #45039, only the datasource tables are supported now, as we have a special way to store hive incompatible schema to the table properties. This is a safe removal because we don't have any release to support that.

Why are the changes needed?

backward-compatibility improvement

Does this PR introduce any user-facing change?

Users can store/read struct-typed columns with special characters.

How was this patch tested?

tests provided by SPARK-22431

DDLSuite.scala:  test("SPARK-22431: table with nested type col with special char") 
DDLSuite.scala:  test("SPARK-22431: view with nested type") 
HiveDDLSuite.scala:  test("SPARK-22431: table with nested type") {
HiveDDLSuite.scala:  test("SPARK-22431: view with nested type") {
HiveDDLSuite.scala:  test("SPARK-22431: alter table tests with nested types") {

tests provided by the previous PR towards SPARK-46934

HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: HMS columns cannot handle quoted columns") 
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types") {
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types with CTAS") {
HiveMetastoreCatalogSuite.scala:  test("SPARK-46934: Handle special characters in struct types with hive DDL") {
HiveDDLSuite.scala:  test("SPARK-46934: quote element name before parsing struct") {
HiveDDLSuite.scala:  test("SPARK-46934: alter table tests with nested types") {

manually backward compatibility test

  1. create a tarball with the current revison
  2. cd dist
  3. using spark-sql to mock data
spark-sql (default)> CREATE TABLE t AS SELECT named_struct('a.b.b', array('a'), 'a b c', map(1, 'a')) AS `a.b`;
  1. copy metadata to 3.5.3 release
cp -r ~/spark/dist/metastore_db .
  1. Fix derby version restrictions
rm jars/derby-10.14.2.0.jar
cp -r ~/spark/dist/jars/derby-10.16.1.1.jar ./jars
  1. read data
spark-sql (default)> select version();
6.5.3 32232e9ed33bb16b93ad58cfde8b82e0f07c0970
Time taken: 0.103 seconds, Fetched 1 row(s)
spark-sql (default)> select * from t;
{"a.b.b":["a"],"a b c":{1:"a"}}
Time taken: 0.09 seconds, Fetched 1 row(s)
spark-sql (default)> desc formatted t;
a.b                 	struct<a.b.b:array<string>,a b c:map<int,string>>

# Detailed Table Information
Catalog             	spark_catalog
Database            	default
Table               	t
Owner               	hzyaoqin
Created Time        	Wed Nov 27 17:40:53 CST 2024
Last Access         	UNKNOWN
Created By          	Spark 4.0.0-SNAPSHOT
Type                	MANAGED
Provider            	parquet
Statistics          	1245 bytes
Location            	file:/Users/hzyaoqin/spark/dist/spark-warehouse/t
Serde Library       	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat         	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat        	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Time taken: 0.054 seconds, Fetched 17 row(s)

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 27, 2024
@@ -1130,10 +1133,6 @@ private[hive] object HiveClientImpl extends Logging {
Option(hc.getComment).map(field.withComment).getOrElse(field)
}

private def verifyColumnDataType(schema: StructType): Unit = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a misleading step for the write path, the schema here is both produced and verified by Spark itself but reports a CANNOT_RECOGNIZE_HIVE_TYPE error to us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change by quoting, this check is always positive

@@ -433,7 +433,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
stringConcat.append("struct<")
var i = 0
while (i < len) {
stringConcat.append(s"${fields(i).name}:${fields(i).dataType.catalogString}")
val name = QuotingUtils.quoteIfNeeded(fields(i).name)
stringConcat.append(s"$name:${fields(i).dataType.catalogString}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zsxwing, this seems also affect output schema of queries

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, cc @cloud-fan. If this is the right option to go with, shall we create a variant for the catalogString method to minimize the impact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the previous discussion thread can be visited at #45039 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant