-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support duplicate column aliases in queries #13489
base: main
Are you sure you want to change the base?
Support duplicate column aliases in queries #13489
Conversation
7d63cee
to
bd2307e
Compare
I think it's a good idea, as it allows the schema of the top plan to include duplicate names, thereby resolving #6543. We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;
+---+
| a |
+---+
| 1 |
+---+
1 row(s) fetched. This query did not return an error like it does in PostgreSQL and before. Perhaps we should improve ambiguity check when searching for field names from schemas after removing |
/// A struct with same fields as [`CreateExternalTable`] struct so that the DDL can be conveniently | ||
/// destructed with validation that each field is handled, while still requiring that all | ||
/// construction goes through the [`CreateExternalTable::new`] constructor or the builder. | ||
pub struct CreateExternalTableFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think non_exhaustive
discourages destructuring, but CreateExternalTableFields
makes it possible again. CreateExternalTableFields
and CreateExternalTable
have the same fields, and I'm a bit worried that it introduces some code duplication 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does introduce code duplication and so does the builder.
When handling the Create Table/View, deconstruction is valuable without ..
, as it guarantees that any new field will force the code to be revisited (rather than new fields being ignored).
However, in Rust deconstruction without ..
is possible only when construction is possible, and direct construction being possible precludes construction-time checks, which is undesirable.
Alternatively to this, we could allow construction of ill-formed Create Table/View objects, and have check somewhere else (plan validator), but i would be worried that such a delayed check could be missed in some code flows. The field duplication isn't a problem from maintainability perspective after all.
In SQL, selecting single column multiple times is legal and most modern databases support this. This commit adds such support to DataFusion too.
Good catch. This is easy to solve. The less easy part is that select * from (select 1 as a, 2 as a) t; should work. However, the |
bd2307e
to
e7e778c
Compare
We might need to introduce column index to differentiate them. |
This works and is definitely easiest to weave into current code, but IMO is very slippery slope. |
Yes, specifically when I was working on postgres internals ~15 years ago, all column references were effectively offsets of the input schema and I remember spending lots of time debugging when the offsets weren't right -- it was hard to keep track of what the offsets were supposed to be That being said, DataFusion PhysicalExpr columns use ordinal offsets and they don't seem to have generated too many debugging headaches |
That may be because we prune columns on the LP level already? |
That is likely, though there is projection pushdown in the physical optimizer too: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/projection_pushdown/index.html 🤔 |
This is anecdote-based, so i will add mine. In Trino the planner uses symbols (single per-query global namespace). Bugs happen and these are caught by ValidateDependenciesChecker (example failure trinodb/trino#22806). With ordinal-based, every such bug would be incorrect but potentially executable plan, so instead of clear error, it could produce incorrect results. Why would we prefer that? |
a gin has been invoked - #13559 |
🤣 😭 |
That error message is 😍
I agree -- an error vs incorrect results sounds like the right tradeoff to me |
Which issue does this PR close?
select *
, followed/prefixed with explicit projections #13476Rationale for this change
In SQL, selecting single column multiple times is legal and most modern
databases support this. This commit adds such support to DataFusion too.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
yes, more valid queries are supported