-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48962
base: master
Are you sure you want to change the base?
[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48962
Conversation
97aa72d
to
70abc52
Compare
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
val newPlan = if (isDDLCommand(plan)) { | ||
transformDDL(plan) | ||
} else { | ||
val newType = stringTypeForDMLCommand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not only for DML but also for normal queries, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean by normal queries; DML includes SELECT, INSERT, UPDATE and DELETE. So essentialy anything this is not create/alter
@@ -43,7 +43,7 @@ import org.apache.spark.sql.types._ | |||
case class CreateTable( | |||
tableDesc: CatalogTable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we replace StringType
without collation in tableDesc
for v1 command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will replace it in the parsed plan so the CreateTable
will have the correct schema of the table
extends ResolveDefaultStringTypes(replaceWithTempType = false) {} | ||
|
||
case class TemporaryStringType(override val collationId: Int) | ||
extends StringType(collationId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't extend StringType
, otherwise StringType(id) == TemporaryStringType(id)
and we may hit the TreeNode transformation issue again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getTemporaryStringType
fn makes sure we get a different collation id. I tried not extending StringType but it causes many issues, since for example we have to create a literal with new type and that would of course fail since it doesn't recognize it
class ResolveDefaultStringTypes(replaceWithTempType: Boolean) extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = { | ||
|
||
val newPlan = if (isDDLCommand(plan)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the code a bit more unified. How about this
var hasTempStringType = boolean
plan.resolveOperators {
case AddColumn =>
var defaultStringType = getDefaultStringType(table) // Always return utf8 for now.
if (defaultStringType == StringType) {
hasTempStringType = true
defaultStringType = TemporaryStringType...
}
do the replacement work
case other DDL commands => ...
case ctas | rtas | create_view | alter_view => ...
case normal query => ...
}
if (hasTempStringType) {
the same pattern match but replace TempStringType only.
}
This code is closer to the eventual code, as we need to determine the default collation with the table/schema/catalog metadata, which is not a global thing, must be determined when a certain command is matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like that approach, even if we didn't have to do the traversal twice I would rather decouple the logic for DDL and DML and not have it all mixed together in one match statement
What changes were proposed in this pull request?
This PR proposes not using session-level collation in DDL commands (create/alter view/table, add/replace columns).
Also, resolution of default collation should happen in the analyzer and not in the parser. However, due to how we are checking for default string type (using reference equals with
StringType
object) we cannot just replace this object withStringType("UTF8_BINARY")
because they compare as equal so the tree node framework will just return the old plan. Because of this we have to perform this resolution twice, once by changing theStringType
object into aTemporaryStringType
and then back toStringType("UTF8_BINARY")
which is not considered a default string type anymore.Another thing is that the dependent rules
ResolveInlineTables
andCollationTypeCoercion
are updated so that they don't execute if there are still unresolved string types in the plan.Why are the changes needed?
The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.