Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48962

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Nov 25, 2024

What changes were proposed in this pull request?

This PR proposes not using session-level collation in DDL commands (create/alter view/table, add/replace columns).

Also, resolution of default collation should happen in the analyzer and not in the parser. However, due to how we are checking for default string type (using reference equals with StringType object) we cannot just replace this object with StringType("UTF8_BINARY") because they compare as equal so the tree node framework will just return the old plan. Because of this we have to perform this resolution twice, once by changing the StringType object into a TemporaryStringType and then back to StringType("UTF8_BINARY") which is not considered a default string type anymore.

Another thing is that the dependent rules ResolveInlineTables and CollationTypeCoercion are updated so that they don't execute if there are still unresolved string types in the plan.

Why are the changes needed?

The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 25, 2024
@stefankandic stefankandic changed the title [DRAFT] Fix session collation order [SPARK-49992][SQL] Default collation resolution for DDL and DML queries Nov 26, 2024
@stefankandic stefankandic marked this pull request as ready for review November 26, 2024 14:32
val newPlan = if (isDDLCommand(plan)) {
transformDDL(plan)
} else {
val newType = stringTypeForDMLCommand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not only for DML but also for normal queries, right?

Copy link
Contributor Author

@stefankandic stefankandic Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean by normal queries; DML includes SELECT, INSERT, UPDATE and DELETE. So essentialy anything this is not create/alter

@@ -43,7 +43,7 @@ import org.apache.spark.sql.types._
case class CreateTable(
tableDesc: CatalogTable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we replace StringType without collation in tableDesc for v1 command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will replace it in the parsed plan so the CreateTable will have the correct schema of the table

extends ResolveDefaultStringTypes(replaceWithTempType = false) {}

case class TemporaryStringType(override val collationId: Int)
extends StringType(collationId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't extend StringType, otherwise StringType(id) == TemporaryStringType(id) and we may hit the TreeNode transformation issue again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTemporaryStringType fn makes sure we get a different collation id. I tried not extending StringType but it causes many issues, since for example we have to create a literal with new type and that would of course fail since it doesn't recognize it

class ResolveDefaultStringTypes(replaceWithTempType: Boolean) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = {

val newPlan = if (isDDLCommand(plan)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the code a bit more unified. How about this

var hasTempStringType = boolean
plan.resolveOperators {
  case AddColumn =>
    var defaultStringType = getDefaultStringType(table) // Always return utf8 for now.
    if (defaultStringType == StringType) {
      hasTempStringType = true
      defaultStringType = TemporaryStringType...
    }
    do the replacement work
  case other DDL commands =>  ...
  case ctas | rtas | create_view | alter_view => ...
  case normal query => ...
}
if (hasTempStringType) {
  the same pattern match but replace TempStringType only.
}

This code is closer to the eventual code, as we need to determine the default collation with the table/schema/catalog metadata, which is not a global thing, must be determined when a certain command is matched.

Copy link
Contributor Author

@stefankandic stefankandic Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like that approach, even if we didn't have to do the traversal twice I would rather decouple the logic for DDL and DML and not have it all mixed together in one match statement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants