perf: unwrap cast for comparing ints =/!= strings #15110

alan910127 · 2025-03-09T14:23:52Z

Which issue does this PR close?

Closes Avoid casting columns when comparing ints and strings #15035.

Rationale for this change

What changes are included in this PR?

Add a new function to cast literals to the target type when specific operators are applied.
Update unwrap_cast_in_comparison_for_binary and is_cast_expr_and_support_unwrap_cast_in_comparison_for_binary to use the new function.

Are these changes tested?

The following test cases are added:

CAST(c1, UTF8) < "123"
CAST(c1, UTF8) = "123"
CAST(c1, UTF8) != "123"
CAST(c6, UTF8) = "123"
CAST(c6, UTF8) != "123"

Are there any user-facing changes?

No.

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

alamb

Thank you so much @alan910127

I think this is looking quite close. I left a suggestion about how to do the conversion

I also think this needs a few more tests as suggested below.

FYI @jayzhan211 and @ion-elgreco

negative / null cases

Can you also test predicates with null literals and

-- compare with Null literal (ScalarValue::Utf8(None)
cast(c1, Utf8) = NULL

-- compare with strings that are not a number
cast(c1, Utf8) = 'not a number'

-- compare with strings that are numbers but do not fit in the range
-- where c1 is a Int8 so max value of 128, try comparing to a number
-- that is greater than 128
cast(c1, Utf8) = '1000'

"End to end" (sql level)

Can we please add at least one "end to end" style test in sqllogictest style (EXPLAIN ....)

Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

alamb · 2025-03-10T11:07:36Z

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

+            | ScalarValue::Utf8View(Some(ref str))
+            | ScalarValue::LargeUtf8(Some(ref str)),
+        ) => match target_type {
+            DataType::Int8 => str.parse::<i8>().ok().map(ScalarValue::from),


by calling ok() here I think this converts a predicate like cast(col, utf8) = 'foo' into col = 'null'

Which is not quite the same thing.

Also, str.parse() likely has subltely different semantics than the arrow cast kernels.

This, can can you please make this function:

Return None if the conversion fails (following the convention of other functions in this module like try_cast_dictionary)

Use the cast kernel to parse the values?

The cast kernel can be called by:

Converting ScalarValue to an array with ScalarValue::to_array

Call cast: https://docs.rs/arrow/latest/arrow/array/cast/index.html

Calling ScalarValue::try_from_array

I tried to use your suggested casting implementation, could you help me check if I'm implementing it correctly?

Also 2 sqllogictest test cases are added (optimized/number not in range)

alamb

Thank you so much @alan910127 -- this is great

While reviewing this PR I had a few more tests I wanted to suggest, but instead of doing another round of PR reviews I figured I would just push the commits directly to save time.

I also verified that DuckDB does this type of simplification too (note the CAST( │ │ '123' AS STRUCT(x INTEGER) which is casting the string ot int

D create table t (x int);
D insert into t values (1);
D explain select * from t where t = '123';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│    ────────────────────   │
│  (struct_pack(x) = CAST(  │
│ '123' AS STRUCT(x INTEGER)│
│             ))            │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│    ────────────────────   │
│          Table: t         │
│   Type: Sequential Scan   │
│       Projections: x      │
│                           │
│          ~1 Rows          │
└───────────────────────────┘

Strangely they don't seem to care about errors:

D  select * from t where t = '999999999999999999999999999999999999';
Conversion Error:
Type VARCHAR with value '999999999999999999999999999999999999' can't be cast to the destination type STRUCT

LINE 1: select * from t where t = '999999999999999999999999999999999999';

D select * from t where t = 'foo';
Conversion Error:
Type VARCHAR with value 'foo' can't be cast to the destination type STRUCT

LINE 1: select * from t where t = 'foo';
                                  ^
D select * from t;
┌───────┐
│   x   │
│ int32 │
├───────┤
│   1   │
└───────┘

alamb · 2025-03-10T18:53:47Z

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

+    target_type: &DataType,
+    op: Operator,
+) -> Option<ScalarValue> {
+    macro_rules! cast_or_else_return_none {


Since ScalarValue is already dynamic, I don't think we need a macro to run on each type

I played around a little locally and this is what I came up with. What do you think?

/// Try to move a cast from a literal to the other side of a `=` / `!=` operator /// /// Specifically, rewrites /// ```sql /// cast(col) <op> <litera> /// ``` /// /// To /// /// ```sql /// col <op> cast(<literal>) /// col <op> <cast_literal> /// ``` fn cast_literal_to_type_with_op( lit_value: &ScalarValue, target_type: &DataType, op: Operator, ) -> Option<ScalarValue> { let ( Operator::Eq | Operator::NotEq, ScalarValue::Utf8(Some(_)) | ScalarValue::Utf8View(Some(_)) | ScalarValue::LargeUtf8(Some(_)), ) = (op, lit_value) else { return None; }; // Only try for integer types (TODO can we do this for other types // like timestamps)? use DataType::*; if matches!( target_type, Int8 | Int16 | Int32 | Int64 | UInt8 | UInt16 | UInt32 | UInt64 ) { let opts = arrow::compute::CastOptions { safe: false, format_options: Default::default(), }; let array = ScalarValue::to_array(lit_value).ok()?; let casted = arrow::compute::cast_with_options(&array, target_type, &opts).ok()?; ScalarValue::try_from_array(&casted, 0).ok() } else { None } }

instead of doing another round of PR reviews I figured I would just push the commits directly to save time.

Thank you! I’m just curious—is storing a parquet file in slt tests common? The tests I’ve worked with so far use tables created in the same test file.

Strangely they don't seem to care about errors:

Interesting! This makes me wonder if they treat this behavior as an "optimization" or if it's simply an expected behavior to cast the literal to the column's type.

I played around a little locally and this is what I came up with. What do you think?

Oh, this looks much better! However, I’d prefer leaving the outer match as is, maybe there’s potential for another optimization in the future?

if they treat this behavior as an "optimization" or if it's simply an expected behavior to cast the literal to the column's type

It looks like my guess might be correct—they always cast the literal to the column's type, regardless of the operator.

D create table t as values (1), (2), (3); D select * from t where col0 < '10'; ┌───────┐ │ col0 │ │ int32 │ ├───────┤ │ 1 │ │ 2 │ │ 3 │ └───────┘

alamb · 2025-03-10T18:56:12Z

@findepi I wonder if you have some time to double check the correctness of this optimization (distributing a cast)

findepi · 2025-03-11T09:03:00Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

@@ -1758,7 +1758,7 @@ impl<S: SimplifyInfo> TreeNodeRewriter for Simplifier<'_, S> {
            // try_cast/cast(expr as data_type) op literal
            Expr::BinaryExpr(BinaryExpr { left, op, right })
                if is_cast_expr_and_support_unwrap_cast_in_comparison_for_binary(
-                    info, &left, &right,
+                    info, &left, &right, op,


add op between left and right args -- this helps understand the role of the two &Expr args to the function
(or maybe even pass the whole &BinaryExpr into the function)

findepi · 2025-03-11T09:07:50Z

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

+            ScalarValue::Utf8(Some(_))
+            | ScalarValue::Utf8View(Some(_))
+            | ScalarValue::LargeUtf8(Some(_)),
+        ) => match target_type {


Does target_type represent source type of the cast?

findepi · 2025-03-11T09:14:14Z

@findepi I wonder if you have some time to double check the correctness of this optimization (distributing a cast)

unwrapping cast is not an easy feat. The current implementation isn't exactly correct, as it conflates different strings which cast back to the same number:

CREATE OR REPLACE TABLE t AS SELECT arrow_cast('123', 'Int64') a;

-- correctly finds the row
SELECT * FROM t WHERE cast(a AS string) = '123';

-- incorrectly also finds the row
SELECT * FROM t WHERE cast(a AS string) = '0123';

alan910127 · 2025-03-11T13:10:41Z

@findepi so the test cases in unwrap_cast are not correct, and they're showing that I'm over-unwrapping. Am I understanding this correctly?

alamb · 2025-03-11T14:16:32Z

@findepi so the test cases in unwrap_cast are not correct, and they're showing that I'm over-unwrapping. Am I understanding this correctly?

I think @findepi is saying that the mapping proposed in the PR is more subtle

casting an integer 123 to a string will result in '123'.

It would not match the string '0123'

However, as written this PR will also match '0123' as we will cast 0123 to an integer which results in 123

So in other words, we should not do the conversion from '0123' (string) to 123 (int)

Maybe we can add a check that the string would be the same when we round tripped it

As in only do the rewrite if the sequence

cast to int
cast (back) to string

results in the exact same string as went int

alamb · 2025-03-11T14:17:02Z

(thank you for checking this @findepi -- 🙏

findepi · 2025-03-11T14:59:41Z

Maybe we can add a check that the string would be the same when we round tripped it

As in only do the rewrite if the sequence

cast to int

cast (back) to string

results in the exact same string as went int

neat

findepi · 2025-03-11T15:09:05Z

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs

+    match (op, lit_value) {
+        (
+            Operator::Eq | Operator::NotEq,
+            ScalarValue::Utf8(Some(_))
+            | ScalarValue::Utf8View(Some(_))
+            | ScalarValue::LargeUtf8(Some(_)),
+        ) => match target_type {
+            DataType::Int8 => cast_or_else_return_none!(lit_value, DataType::Int8),
+            DataType::Int16 => cast_or_else_return_none!(lit_value, DataType::Int16),
+            DataType::Int32 => cast_or_else_return_none!(lit_value, DataType::Int32),
+            DataType::Int64 => cast_or_else_return_none!(lit_value, DataType::Int64),
+            DataType::UInt8 => cast_or_else_return_none!(lit_value, DataType::UInt8),
+            DataType::UInt16 => cast_or_else_return_none!(lit_value, DataType::UInt16),
+            DataType::UInt32 => cast_or_else_return_none!(lit_value, DataType::UInt32),
+            DataType::UInt64 => cast_or_else_return_none!(lit_value, DataType::UInt64),
+            _ => None,
+        },
+        _ => None,


As it's written, and as the function is named, this looks like generic approach that will work many different source/target type pairs, we just need to add logic.
However, the exact checks necessary for soundness will likely differ.

For casts between exact numbers (ints and decimals) and string types, the #15110 (comment) looks like a sound plan.
Plus, we could also simplify impossible target literals like this

cast(an_int as string) = 'abc' or cast(an_int as string) = '007' to if(an_int is null, null, false)

For casts between numbers and numbers, we likely need to recognize max addressible values, and cast impreciseness, especially when we want to handle <, <=, >, >= operators.
https://github.com/trinodb/trino/blob/a870656f406b0f76e97c740659a58554d472994d/core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/UnwrapCastInComparison.java#L199-L354 might serve as a good reference. Note how it handles NaN, max values, lossy round-tripping, sometimes converting < to <= as necessary.

For e.g cast from timestamp to date, the logic will likely be different again
https://github.com/trinodb/trino/blob/a870656f406b0f76e97c740659a58554d472994d/core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/UnwrapCastInComparison.java#L357-L389.

With all this in mind, I'd suggest structuring this code so it's clear it addresses only the exact number and string case, by checking expr and literal types and delegating to a function that has "int" and "utf8" or "string" in the name.

alan910127 · 2025-03-11T15:17:00Z

As in only do the rewrite if the sequence

cast to int

cast (back) to string

results in the exact same string as went int

@alamb I found that this approach is different from how postgres and duckdb handle this situation:

SELECT * FROM t WHERE a = '0123' - they will cast the '0123' to an integer and it will match, so 1 row is returned.
SELECT * FROM t WHERE cast(a AS string) = '0123' - they will not cast the '0123', and this is a string comparison, so no rows are returned.

postgres

db=# create table t as select 123 a;
SELECT 1
db=# select * from t where a = '0123';
  a
-----
 123
(1 row)

db=# select * from t where cast(a AS text) = '0123';
 a
---
(0 rows)

duckdb

D create table t as select 123 a;
D select * from t where a = '0123';
┌───────┐
│   a   │
│ int32 │
├───────┤
│  123  │
└───────┘
D select * from t where cast(a AS string) = '0123';
┌────────┐
│   a    │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

I'm not sure if it's only possible to handle it when the cast is created (type coercion?)

alamb · 2025-03-11T15:32:49Z

I'm not sure if it's only possible to handle it when the cast is created (type coercion?)

That may well be the right thing to do 🤔

It is funny that @scsmithr filed a ticket this morning with a similar symptom:

Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161

🤔

This reverts commit 808d6ab.

findepi · 2025-03-11T16:34:29Z

This cast unwrapping is valuable orthogonally to what coercions get applied by the frontend.

And it looks feasible -- the plan in #15110 (comment) IMO should work (for integer and string type pairs).

scsmithr · 2025-03-11T16:34:58Z

I'm not sure if it's only possible to handle it when the cast is created (type coercion?)

That may well be the right thing to do 🤔

It is funny that @scsmithr filed a ticket this morning with a similar symptom:
* [Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161](https://github.com/apache/datafusion/issues/15161)
🤔

For context, I was running some implicit cast queries on a few different systems to compare outputs, and hit the aforementioned issue with datafusion. "Unexpected" might be a better word than "incorrect", but datafusion is the outlier here.

My take on this:

I don't believe manipulating casts after-the-fact is the right approach.

Given the query select * from t1 where column1 < '10'; where column1 is an Int32, the literal '10' should be treated as unknown, and the preference should be to cast that. The generated predicate should then be column1 < cast('10' as int), and whatever const folding happens will then take care of parsing '10' as an int during planning.

Without a proper "unknown" type, I think it should be sufficient to rank implicitly casting to utf8 last compared to any other implicit cast.

alan910127 · 2025-03-11T16:48:35Z

@findepi sorry I didn't notice your comment and I just pushed a new version with the unwrapping logic deleted.

So you think the two optimizations should coexist?

And it looks feasible -- the plan in #15110 (comment) IMO should work

But we have no way to tell if the cast is set by the user or not in the unwrap cast function. Doesn't this matter?

…cast_for_int_eq_string

alamb · 2025-03-11T21:28:22Z

I don't believe manipulating casts after-the-fact is the right approach.

Given the query select * from t1 where column1 < '10'; where column1 is an Int32, the literal '10' should be treated as unknown, and the preference should be to cast that. The generated predicate should then be column1 < cast('10' as int), and whatever const folding happens will then take care of parsing '10' as an int during planning.

I agree @scsmithr -- For the case in #15161 I agree that changing the coercion rules as you suggest is a better plan than trying to fix it up afterwards

However, I think the optimization in this PR is valuable for cases when the user has ended up with the casts some other way and DataFusion is now optimizing the expression

@findepi sorry I didn't notice your comment and I just pushed a new version with the unwrapping logic deleted.

So you think the two optimizations should coexist?

That is my understanding and also what I think should happen

And it looks feasible -- the plan in #15110 (comment) IMO should work

But we have no way to tell if the cast is set by the user or not in the unwrap cast function. Doesn't this matter?

I don't think the source of the cast matters (for this particular optimization)

alan910127 · 2025-03-12T00:27:58Z

I don't think the source of the cast matters (for this particular optimization)

Ah, yes. Since the string -> int -> string is checked and we're only checking equality, then they'll be the same. But I' m not sure if this unwrapping would happen after the string literal coercion.

That is my understanding and also what I think should happen

Got it, should I revert this PR to the unwrap cast optimization or directly combine both the optimization and the literal coercion in this PR?

jayzhan211 · 2025-03-12T04:18:10Z

I don't believe manipulating casts after-the-fact is the right approach.

Given the query select * from t1 where column1 < '10'; where column1 is an Int32, the literal '10' should be treated as unknown, and the preference should be to cast that. The generated predicate should then be column1 < cast('10' as int), and whatever const folding happens will then take care of parsing '10' as an int during planning.

Without a proper "unknown" type, I think it should be sufficient to rank implicitly casting to utf8 last compared to any other implicit cast.

Maybe we can figure out if it is possible to parse to the idea type for scalar, so '123' can be casted to i64 not str and '1.2' can be casted to float not str.

jayzhan211 · 2025-03-13T12:24:02Z

I try a bit and I think there are still some works to be done in how we parse the sql. #15202.

jayzhan211 · 2025-03-13T13:17:41Z

Actually I'm quite curious is string literal really an issue? If we want string, we can have query with quote select * from t1 where column1 < '10';, while if we want numeric, we can have query without quote select * from t1 where column1 < 10;. In practice, when is it beneficial to convert string literal into numeric type? 🤔

alamb

To summarize the feedback so far:

I think this PR needs one more tests as suggested by @findepi #15110 (comment)

-- should not remove cast on string
explain select a from t where cast(a as string) = '0100';

Here is a suggestion on making that pass: #15110 (comment)

Maybe we can add @findepi 's analysis in as comments too

I recommend that we file separate tickets for future work (such as supporting other types).

Does this make sense @alan910127 ?

alamb · 2025-03-13T20:16:23Z

datafusion/core/tests/expr_api/mod.rs

@@ -330,12 +330,12 @@ async fn test_create_physical_expr_coercion() {
    create_expr_test(lit(1i32).eq(col("id")), "CAST(1 AS Utf8) = id@0");
    // compare int col to string literal `i = '202410'`
    // Note this casts the column (not the field)
-    create_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410");
-    create_expr_test(lit("202410").eq(col("i")), "202410 = CAST(i@1 AS Utf8)");
+    create_expr_test(col("i").eq(lit("202410")), "i@1 = 202410");


perf: unwrap cast for comparing ints =/!= strings

808d6ab

github-actions bot added the optimizer Optimizer rules label Mar 9, 2025

alamb reviewed Mar 10, 2025

View reviewed changes

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs Outdated Show resolved Hide resolved

alamb reviewed Mar 10, 2025

View reviewed changes

alan910127 added 2 commits March 10, 2025 21:59

fix: update casting logic

dd71241

test: add more unit test and new sqllogictest

dc93afe

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 10, 2025

Tweak slt tests

8fab722

alamb approved these changes Mar 10, 2025

View reviewed changes

findepi reviewed Mar 11, 2025

View reviewed changes

alamb mentioned this pull request Mar 11, 2025

Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161

Open

Revert "perf: unwrap cast for comparing ints =/!= strings"

a878b07

This reverts commit 808d6ab.

fix: eliminate column cast and cast literal before coercion

b212292

alan910127 added 2 commits March 12, 2025 02:31

Merge branch 'main' of github.com:apache/datafusion into perf/unwrap_…

587dc0b

…cast_for_int_eq_string

fix: physical expr coercion test

3059d5c

github-actions bot added the core Core DataFusion crate label Mar 11, 2025

alamb reviewed Mar 13, 2025

View reviewed changes

alamb mentioned this pull request Mar 13, 2025

Weekly Plan (Andrew Lamb) March 10, 2025 #15121

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: unwrap cast for comparing ints =/!= strings #15110

perf: unwrap cast for comparing ints =/!= strings #15110

alan910127 commented Mar 9, 2025 •

edited

Loading

alamb left a comment

alamb Mar 10, 2025

alan910127 Mar 10, 2025

alan910127 Mar 10, 2025

alamb left a comment

alamb Mar 10, 2025

alan910127 Mar 11, 2025 •

edited

Loading

alan910127 Mar 11, 2025

alamb commented Mar 10, 2025

findepi Mar 11, 2025

findepi Mar 11, 2025

findepi commented Mar 11, 2025

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

alamb commented Mar 11, 2025

findepi commented Mar 11, 2025

findepi Mar 11, 2025

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

findepi commented Mar 11, 2025

scsmithr commented Mar 11, 2025 •

edited

Loading

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

alan910127 commented Mar 12, 2025

jayzhan211 commented Mar 12, 2025

jayzhan211 commented Mar 13, 2025

jayzhan211 commented Mar 13, 2025

alamb left a comment

alamb Mar 13, 2025

perf: unwrap cast for comparing ints =/!= strings #15110

Are you sure you want to change the base?

perf: unwrap cast for comparing ints =/!= strings #15110

Conversation

alan910127 commented Mar 9, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

negative / null cases

"End to end" (sql level)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alan910127 Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Mar 11, 2025

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

alamb commented Mar 11, 2025

findepi commented Mar 11, 2025

Choose a reason for hiding this comment

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

findepi commented Mar 11, 2025

scsmithr commented Mar 11, 2025 • edited Loading

alan910127 commented Mar 11, 2025

alamb commented Mar 11, 2025

alan910127 commented Mar 12, 2025

jayzhan211 commented Mar 12, 2025

jayzhan211 commented Mar 13, 2025

jayzhan211 commented Mar 13, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alan910127 commented Mar 9, 2025 •

edited

Loading

alan910127 Mar 11, 2025 •

edited

Loading

scsmithr commented Mar 11, 2025 •

edited

Loading