Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saner handling of nulls inside arrays #15149

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Mar 11, 2025

Which issue does this PR close?

Rationale for this change

#10790 changed empty arrays to default to Int64 but that is just a band-aid which creates other problems downstream, e.g. type errors when nesting expressions. It also introduced a bug (and possible panic) in array_union and array_intersect which are fixed here.

Apart from that, from a type-theoretical PoV empty arrays should indeed be typed as Array<Null> because Null is coercible to any other type and by extension (immutable arrays are covariant) Array<Null> is coercible to any other array type.

The crux of the problem is that Arrow is incapable of dealing with the mismatching array types that arise from this encoding. To workaround this limitation of Arrow, we unify array function argument types in advance and coerce them to the common type. In addition, we handle DataType::Null explicitly in Array functions for the case when all arguments are empty arrays or nulls.

What changes are included in this PR?

We use the existing type_union_resolution type unification algorithm to handle array function arguments. This is a more robust and consistent approach as opposed to implementing custom coercion rules. Depending on the function we unify either the base types (because some functions work on arrays of arbitrary dimensions) or the array / element types. Then we coerce all arguments to the common type. We still need to handle the edge case when all arguments are nulls or empty arrays explicitly in the array function implementations.

Are these changes tested?

Yes, I have amended the array.slt tests accordingly.

Are there any user-facing changes?

Yes, there are some changes in array function semantics that can be seen in the updated tests:

  • More expressions involving empty arrays and null arrays now work out of the box
  • Fixed bugs in array_union and array_intersect (which are now also commutative)
  • Some type-incorrect expressions no longer compile (e.g. array_position(Array<Int64>, Utf8))

I also extended the public API of Signature with more helper methods and by default it coerces fixed-size lists to non-fixed-size lists because most array functions don't preserve the arity of lists (e.g. append, prepend, concat, union, intersection, etc.)

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Mar 11, 2025
@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch 6 times, most recently from 5fe5bf3 to 57d105b Compare March 11, 2025 10:25
@joroKr21 joroKr21 marked this pull request as ready for review March 11, 2025 11:08
@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

Thank you @joroKr21 !

Perhaps @Dandandan or @thinkharderdev or @avantgardnerio could help review this PR

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. I'm unclear on the coercion to variable length arrays though

ArrayFunctionArgument::Element,
ArrayFunctionArgument::Array,
],
array_coercion: Some(ListCoercion::FixedSizedListToList),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm not entirely clear on how the coercion comes into play but an append to a fixed size list would still be a fixed size list right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a type signature PoV yes, it's possible to infer that the return type should also be a fixed-size list with size n + 1 but the implementation itself is not able to handle it:

pub(crate) fn array_append_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
    let [array, _] = take_function_args("array_append", args)?;

    match array.data_type() {
        DataType::LargeList(_) => general_append_and_prepend::<i64>(args, true),
        _ => general_append_and_prepend::<i32>(args, true),
    }
}

It's unclear to me how an implementation for FixedSizeList would look like and I would say probably it's out of scope for this PR, it should be separate.

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch from 57d105b to 0d49f0d Compare March 13, 2025 16:02
@joroKr21
Copy link
Contributor Author

Also added a commit to fix array_sort for empty record batches

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants