Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject CREATE TABLE/VIEW with duplicate column names #13517

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 21, 2024

Reject CREATE TABLE/VIEW with duplicate column names

DFSchema checks column for uniqueness, but allows duplicate column names
when they are qualified differently. This is because DFSchema plays
central role during query planning as a identifier resolution scope.

Those checks in their current form should not be there, since they
prevent execution of queries with duplicate column aliases, which is
legal in SQL. But even with these checks present, they are not
sufficient to ensure CREATE TABLE/VIEW is well structured. Table or
view columns need to have unique names and there is no qualification
involved.

This commit adds necessary checks in CREATE TABLE/VIEW DDL structs,
ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.

This PR includes

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Nov 21, 2024
/// Whether the table is an infinite streams
pub unbounded: bool,
/// Table(provider) specific options
pub options: HashMap<String, String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add comments what is a key and what is the value in the hashmaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I don't know what these are yet. This is copied from CreateExternalTable. This whole struct allows public full destruction of CreateExternalTable, without allowing public construction of that struct, so that validation can take place. Would it help if i removed all these doc comments, so that documentation is checked at CreateExternalTable only?

]))
)?)
.unwrap_err().strip_backtrace().to_string(),
"Schema error: Schema contains qualified fields with duplicate unqualified names t1.c1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I feel lower case col names are not easy to read in the error message. Will tweak the schema_err! output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can wrap them in brackets [...] to clearly separate from surrounding text, while making it kind of obvious we're not bothering to produce proper quoted identifiers.

DFSchema checks column for uniqueness, but allows duplicate column names
when they are qualified differently. This is because DFSchema plays
central role during query planning as a identifier resolution scope.

Those checks in their current form should not be there, since they
prevent execution of queries with duplicate column aliases, which is
legal in SQL. But even with these checks present, they are not
sufficient to ensure CREATE TABLE/VIEW is well structured. Table or
view columns need to have unique names and there is no qualification
involved.

This commit adds necessary checks in CREATE TABLE/VIEW DDL structs,
ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @comphead -- I think this is a very nice improvement

I am not sure about the *Fields pattern. Let me know what you think

constraints: Constraints::empty(),
column_defaults: Default::default(),
},
&CreateExternalTable::builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

pub temporary: bool,
}

impl CreateMemoryTable {
pub fn new(fields: CreateMemoryTableFields) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to call this try_new() to signal it can return an error

/// Creates an in memory table.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.

@@ -192,11 +193,12 @@ impl DdlStatement {

/// Creates an external table.
#[derive(Debug, Clone, PartialEq, Eq)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it non exhaustive? That might make it harder to write match statements where all the fields are matched (and the compiler tells you when new ones are added)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it non exhaustive?

To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.

That might make it harder to write match statements where all the fields are matched (and the compiler tells you when new ones are added)

💯 and i think this is important use-case. that's why those new into_fields() are added

@@ -288,8 +290,234 @@ impl PartialOrd for CreateExternalTable {
}
}

impl CreateExternalTable {
pub fn new(fields: CreateExternalTableFields) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the fields in CreateExternalTable are pub anyway, I don't think introducing CreateExternalTableFields requires all construction to go through new 🤔

Therefore I would personally suggest removing the CreateExternalTableFields (and equivalent for MemoryTable) and staying with the struct and the builder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the fields in CreateExternalTable are pub anyway, I don't think introducing CreateExternalTableFields requires all construction to go through new 🤔

i think it's disallowed by non_exhaustive.
in fact, i did revert this PR changes in datafusion/core/src/catalog_common/listing_schema.rs and cargo check failed with

error[E0639]: cannot create non-exhaustive struct using struct expression
   --> datafusion/core/src/catalog_common/listing_schema.rs:134:26
    |
134 |                           &CreateExternalTable {
    |  __________________________^
135 | |                             schema: Arc::new(DFSchema::empty()),
136 | |                             name,
137 | |                             location: table_url,
...   |
147 | |                             column_defaults: Default::default(),
148 | |                         },
    | |_________________________^

For more information about this error, try `rustc --explain E0639`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see -- that was non obvious.

Given that what do you think then about making all the fields not pub instead? I think that would be less confusing / more obvious.

Also I do think it is worth considering if we could make a smaller / less breaking change here -- I understand the desire to ensure only semantically correct CreateExternalTable structures can be created, but in my opinion the extra cognative overhead of the *Field structures makes it significantly harder to work with this code.

What if we

  1. Kept the CreateExternalTableBuilder (to generate errors during construction)
  2. Added another check somewhere when it was processed -- for example in that there were no duplicate columns in the schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that what do you think then about making all the fields not pub instead? I think that would be less confusing / more obvious.

This would be more breaking.
And we would need to add accessor for all these fields, for cases when someone wants to inspect some attributes without deconstructing

Added another check somewhere when it was processed -- for example in
datafusion/datafusion/core/src/datasource/listing_table_factory.rs

That could work, but where could this check be?
The link is currently in core byt the listing table is something we would want to move out.
And this should be the core's guarantee, regardless which table provider is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CREATE TABLE succeeds when schema has duplicate names, resulting in a table that cannot be selected from
3 participants