Support duplicate column aliases in queries #13489

findepi · 2024-11-19T13:02:52Z

Which issue does this PR close?

fixes Query fails when using select *, followed/prefixed with explicit projections #13476
fixes CREATE TABLE succeeds when schema has duplicate names, resulting in a table that cannot be selected from #13487
fixes Support columns having the same alias #6543

Rationale for this change

In SQL, selecting single column multiple times is legal and most modern
databases support this. This commit adds such support to DataFusion too.

What changes are included in this PR?

allow creation of schemas for duplicated names. DFSchema is used in logical plan to describe output of a relational operator, and In SQL, this is totally valid to have duplicated names
- a better long term fix would be to limit Schema use for field resolution during initial query plan building and use unambiguous "symbols" or "variables" in the logical plan. This would fall under [Epic] Make DataFusion a reliable foundation for building query engines #12723
add checks for CREATE TABLE and CREATE VIEW to disallow creation of tables with duplicate field names
- previously this was taken care of by schema construction checks, but only partially, as witnessed by CREATE TABLE succeeds when schema has duplicate names, resulting in a table that cannot be selected from #13487

Are these changes tested?

yes

Are there any user-facing changes?

yes, more valid queries are supported

jonahgao · 2024-11-20T07:25:45Z

allow creation of schemas for duplicated names

I think it's a good idea, as it allows the schema of the top plan to include duplicate names, thereby resolving #6543.

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;
+---+
| a |
+---+
| 1 |
+---+
1 row(s) fetched.

This query did not return an error like it does in PostgreSQL and before. Perhaps we should improve ambiguity check when searching for field names from schemas after removing check_names.

jonahgao · 2024-11-20T07:35:00Z

datafusion/expr/src/logical_plan/ddl.rs

+/// A struct with same fields as [`CreateExternalTable`] struct so that the DDL can be conveniently
+/// destructed with validation that each field is handled, while still requiring that all
+/// construction goes through the [`CreateExternalTable::new`] constructor or the builder.
+pub struct CreateExternalTableFields {


I think non_exhaustive discourages destructuring, but CreateExternalTableFields makes it possible again. CreateExternalTableFields and CreateExternalTable have the same fields, and I'm a bit worried that it introduces some code duplication 🤔.

Yes, it does introduce code duplication and so does the builder.
When handling the Create Table/View, deconstruction is valuable without .., as it guarantees that any new field will force the code to be revisited (rather than new fields being ignored).

However, in Rust deconstruction without .. is possible only when construction is possible, and direct construction being possible precludes construction-time checks, which is undesirable.

Alternatively to this, we could allow construction of ill-formed Create Table/View objects, and have check somewhere else (plan validator), but i would be worried that such a delayed check could be missed in some code flows. The field duplication isn't a problem from maintainability perspective after all.

In SQL, selecting single column multiple times is legal and most modern databases support this. This commit adds such support to DataFusion too.

findepi · 2024-11-20T14:13:56Z

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example
DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;

Good catch. This is easy to solve.

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

jonahgao · 2024-11-21T09:37:47Z

The less easy part is that
select * from (select 1 as a, 2 as a) t;
should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

We might need to introduce column index to differentiate them.
Since this case was not previously supported either, maybe we can handle it later.

findepi · 2024-11-22T15:37:12Z

We might need to introduce column index to differentiate them.

This works and is definitely easiest to weave into current code, but IMO is very slippery slope.
On the last DF contributor call @alamb mentioned many hours spent debugging why column ordinals are incorrect when working on PostgreSQL internals.
Being cautious of engineers' sanity I would strongly prefer globally unique symbols, because then errors are possible to catch.

alamb · 2024-11-23T11:37:13Z

On the last DF contributor call @alamb mentioned many hours spent debugging why column ordinals are incorrect when working on PostgreSQL internals.
Being cautious of engineers' sanity I would strongly prefer globally unique symbols, because then errors are possible to catch.

Yes, specifically when I was working on postgres internals ~15 years ago, all column references were effectively offsets of the input schema and I remember spending lots of time debugging when the offsets weren't right -- it was hard to keep track of what the offsets were supposed to be

That being said, DataFusion PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches

findepi · 2024-11-23T12:02:55Z

That being said, DataFusion PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches

That may be because we prune columns on the LP level already?

alamb · 2024-11-23T12:24:30Z

That being said, DataFusion PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches

That may be because we prune columns on the LP level already?

That is likely, though there is projection pushdown in the physical optimizer too: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/projection_pushdown/index.html 🤔

findepi · 2024-11-23T17:36:08Z

This is anecdote-based, so i will add mine. In Trino the planner uses symbols (single per-query global namespace). Bugs happen and these are caught by ValidateDependenciesChecker (example failure trinodb/trino#22806). With ordinal-based, every such bug would be incorrect but potentially executable plan, so instead of clear error, it could produce incorrect results. Why would we prefer that?

findepi · 2024-11-25T19:16:16Z

PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches

a gin has been invoked - #13559

alamb · 2024-11-25T21:41:28Z

PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches

a gin has been invoked - #13559

🤣 😭

alamb · 2024-11-25T22:05:05Z

Bugs happen and these are caught by ValidateDependenciesChecker (example failure trinodb/trino#22806).

That error message is 😍

Suppressed: java.lang.Exception: Current plan:
                Output[columnNames = [a, _col1, _col2, _col3, _col4, _col5, _col6]]
                │   Layout: [field:integer, sum:bigint, sum_15:bigint, sum_16:bigint, sum_17:bigint, sum_18:bigint, sum_19:bigint]
                │   a := field
                │   _col1 := sum
                │   _col2 := sum_15
                │   _col3 := sum_16
                │   _col4 := sum_17
                │   _col5 := sum_18
                │   _col6 := sum_19
                └─ Aggregate[keys = [field]]
                   │   Layout: [field:integer, sum:bigint, sum_15:bigint, sum_16:bigint, sum_17:bigint, sum_18:bigint, sum_19:bigint]
                   │   sum := sum(sum_28)
                   │   sum_15 := sum(sum_29)
                   │   sum_16 := sum(sum_30)
                   │   sum_17 := sum(sum_31)
                   │   sum_18 := sum(sum_32)
                   │   sum_19 := sum(sum_33)
                   └─ Project[]
                      │   Layout: [field:integer, sum_28:bigint, sum_29:bigint, sum_30:bigint, sum_31:bigint, sum_32:bigint, sum_33:bigint]
                      │   sum_28 := (CASE WHEN (field_0 = integer '1') THEN sum_21 ELSE bigint '0' END)
                      │   sum_29 := (CASE WHEN (field_0 = integer '2') THEN sum_23 ELSE CAST(field_1 AS bigint) END)
                      │   sum_30 := (CASE WHEN (field_0 = integer '3') THEN sum_23 ELSE CAST(field_2 AS bigint) END)
                      │   sum_31 := (CASE WHEN (field_0 = integer '4') THEN sum_25 ELSE bigint '0' END)
                      │   sum_32 := (CASE WHEN (field_0 = integer '5') THEN sum_27 ELSE bigint '0' END)
                      │   sum_33 := (CASE WHEN (field_0 = integer '6') THEN sum_23 ELSE CAST(field_3 AS bigint) END)
                      └─ Aggregate[type = FINAL, keys = [field_0, field]]
                         │   Layout: [field_0:integer, field:integer, sum_21:bigint, sum_23:bigint, sum_25:bigint, sum_27:bigint]
                         │   sum_21 := sum(sum_34)
                         │   sum_23 := sum(sum_35)
                         │   sum_25 := sum(sum_36)
                         │   sum_27 := sum(sum_37)
                         └─ LocalExchange[partitioning = HASH, arguments = [field::integer]]
                            │   Layout: [field_0:integer, field:integer, sum_34:bigint, sum_35:bigint, sum_36:bigint, sum_37:bigint]
                            └─ Aggregate[type = PARTIAL, keys = [field_0, field]]
                               │   Layout: [field_0:integer, field:integer, sum_34:bigint, sum_35:bigint, sum_36:bigint, sum_37:bigint]
                               │   sum_34 := sum(expr_20)
                               │   sum_35 := sum(expr_22)
                               │   sum_36 := sum(expr_24)
                               │   sum_37 := sum(expr_26)
                               └─ Values[]
                                      Layout: [field_0:integer, field:integer, expr_20:bigint, expr_22:bigint, expr_24:bigint, expr_26:bigint]
                                      (integer '1', integer '1', bigint '1', bigint '0', bigint '1', bigint '1')
                                      (integer '2', integer '2', bigint '2', bigint '0', bigint '2', bigint '2')
                                      (integer '3', integer '3', bigint '3', bigint '0', bigint '3', bigint '3')

		at io.trino.sql.planner.sanity.PlanSanityChecker.validate(PlanSanityChecker.java:120)
		... 17 more

With ordinal-based, every such bug would be incorrect but potentially executable plan, so instead of clear error, it could produce incorrect results. Why would we prefer that?

I agree -- an error vs incorrect results sounds like the right tradeoff to me

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Nov 19, 2024

findepi force-pushed the findepi/support-duplicate-column-aliases-0bf339 branch 2 times, most recently from 7d63cee to bd2307e Compare November 19, 2024 14:03

jonahgao reviewed Nov 20, 2024

View reviewed changes

findepi added 4 commits November 20, 2024 10:30

Encapsulate create table/view construction

168c696

Support duplicate column aliases in queries

9ea86ca

In SQL, selecting single column multiple times is legal and most modern databases support this. This commit adds such support to DataFusion too.

Schema ambiguity checks when dereferencing

f9cc687

Wildcard expansion fixes

e7e778c

findepi mentioned this pull request Nov 20, 2024

Remove Alias from Expr #1468

Open

findepi force-pushed the findepi/support-duplicate-column-aliases-0bf339 branch from bd2307e to e7e778c Compare November 20, 2024 14:20

findepi marked this pull request as draft November 20, 2024 14:20

github-actions bot added the substrait label Nov 20, 2024

findepi mentioned this pull request Nov 21, 2024

Reject CREATE TABLE/VIEW with duplicate column names #13517

Open

findepi mentioned this pull request Nov 22, 2024

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Open

10 tasks

findepi mentioned this pull request Nov 25, 2024

Referencing a column from select and order by clauses triggers duplicate expression error #13558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support duplicate column aliases in queries #13489

Support duplicate column aliases in queries #13489

findepi commented Nov 19, 2024

jonahgao commented Nov 20, 2024

jonahgao Nov 20, 2024

findepi Nov 20, 2024

findepi commented Nov 20, 2024

jonahgao commented Nov 21, 2024

findepi commented Nov 22, 2024

alamb commented Nov 23, 2024

findepi commented Nov 23, 2024

alamb commented Nov 23, 2024

findepi commented Nov 23, 2024

findepi commented Nov 25, 2024

alamb commented Nov 25, 2024

alamb commented Nov 25, 2024

Support duplicate column aliases in queries #13489

Are you sure you want to change the base?

Support duplicate column aliases in queries #13489

Conversation

findepi commented Nov 19, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jonahgao commented Nov 20, 2024

jonahgao Nov 20, 2024

Choose a reason for hiding this comment

findepi Nov 20, 2024

Choose a reason for hiding this comment

findepi commented Nov 20, 2024

jonahgao commented Nov 21, 2024

findepi commented Nov 22, 2024

alamb commented Nov 23, 2024

findepi commented Nov 23, 2024

alamb commented Nov 23, 2024

findepi commented Nov 23, 2024

findepi commented Nov 25, 2024

alamb commented Nov 25, 2024

alamb commented Nov 25, 2024