Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid schema for unions in ViewTables #15135

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Friede80
Copy link

Which issue does this PR close?

Rationale for this change

At the core, if a Union goes through the TypeCoercion analyzer twice, the schema can drop the qualifier information for fields that were cast in the first pass of TypeCoercion. To preserve this information, this PR changes coerce_union_schema to start from the schema of the Union instead of the schema of the first child.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, this changes the signature of coerce_union_schema so that the schema carried by the Union is available.

Before:

pub fn coerce_union_schema(inputs: &[Arc<LogicalPlan>]) -> Result<DFSchema>

After:

pub fn coerce_union_schema(union_plan: &Union) -> Result<DFSchema>

@github-actions github-actions bot added the optimizer Optimizer rules label Mar 10, 2025
@Friede80
Copy link
Author

Friede80 commented Mar 10, 2025

I'm not sure if there are still valid uses of coerce_union_schema given only the set of logical plans, but if we can't change the api of a public function, it would be easy enough to accomplish this with a wrapper. i.e.

pub fn coerce_union_schema(inputs: &[Arc<LogicalPlan>]) -> Result<DFSchema> {
    coerce_union_schema_with_schema(inputs, inputs[0].schema())
}
fn coerce_union_schema_with_schema(inputs: &[Arc<LogicalPlan>], base_schema: &DFSchemaRef) -> Result<DFSchema> {
...
}

@xudong963 xudong963 self-requested a review March 11, 2025 07:30
@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Mar 12, 2025
@Friede80
Copy link
Author

Friede80 commented Mar 12, 2025

There seems to have been quite a few cracks at this particular situation:

The trouble seems to stem from the decision to maintain TableReference qualifiers in unions for disambiguation. This makes it very difficult to know when these qualifiers should or shouldn't be passed through at different points in the optimizer. Unless there are well specified semantics around this that I'm missing, I think it would be better go back to the assumption that qualifiers never persist through a union. And we can continue to support field disambiguation by generating unique column names as is done in DuckDB (deduplicating-identifiers)

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

There seems to have been quite a few cracks at this particular situation:

The trouble seems to stem from the decision to maintain TableReference qualifiers in unions for disambiguation. This makes it very difficult to know when these qualifiers should or shouldn't be passed through at different points in the optimizer. Unless there are well specified semantics around this that I'm missing, I think it would be better go back to the assumption that qualifiers never persist through a union. And we can continue to support field disambiguation by generating unique column names as is done in DuckDB (deduplicating-identifiers)

Maybe @jonahgao can offer an opinion on this suggestion

@jonahgao jonahgao self-requested a review March 13, 2025 09:59
@@ -776,8 +777,32 @@ impl LogicalPlanBuilder {
&missing_cols,
is_distinct,
)?;

let mut sort_exprs = normalize_sorts(sorts, &plan)?;
if matches!(&plan, LogicalPlan::Union(_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? It seems to make the following invalid query executable.

DataFusion CLI v46.0.0
> create table t1(a int);
0 row(s) fetched.
Elapsed 0.007 seconds.

> create table t2(a int);
0 row(s) fetched.
Elapsed 0.007 seconds.

> select t1.a from t1 union all select t2.a from t2 order by not_exist_table.a;
+---+
| a |
+---+
+---+
0 row(s) fetched.
Elapsed 0.008 seconds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it feels strange and I wouldn't mind getting rid of this, but I put it in for backwards compatibility, because currently if you ran:

select t1.a from t1 union all select t2.a from t2 order by t1.a

it would work. But after this PR, it would fail without this pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we stripped qualifiers, it should be reasonable that order by t1.a fails. I tested this behavior in PostgreSQL, and it works that way.

psql=> select a from t1 union all select a from t2 order by t1.a;
ERROR:  missing FROM-clause entry for table "t1"
LINE 1: select a from t1 union all select a from t2 order by t1.a;
                                                             ^
psql=> select a from t1 union all select a from t2 order by a;
 a
---
(0 rows)

Moreover, it seems like a bug to me if order by not_exist_table.a executes successfully.


// Generate unique field name
let name = if let Some(count) = name_counts.get_mut(&base_name) {
*count += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me👍. But it doesn't work for this case:

> create table t1(a int);
0 row(s) fetched.
Elapsed 0.005 seconds.

> create table t2(a int);
0 row(s) fetched.
Elapsed 0.007 seconds.

> select * from t1,t2 union all select * from t1,t2;
type_coercion
caused by
Schema error: Schema contains duplicate unqualified field name

The above query works on the main branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like TypeCoercion undoes the renaming efforts here.
Maybe TypeCoercion should only modify the data types of union inputs, rather than field names 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, because the union schema is discarded during coercion. My initial attempt to fix this issue propagated the union schema through, but I removed it when I decided not to maintain qualifiers, but I think adding that back would resolve this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 612c217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid schema for unions in ViewTable
3 participants