Skip to content

feat: Change SQL-Explode/UNNEST to Dataframe.explode method #22546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Felix-Blom
Copy link

@Felix-Blom Felix-Blom commented May 1, 2025

Closes: #22545

Summary

Currently, using UNNEST (which corresponds to explode in Polars) within sqlContext relies on the Expr.explode() method, which does not preserve the row-wise mapping between the exploded list and the other columns in the DataFrame. As a result, attempting to UNNEST a list column alongside another column (e.g., sort_key) does not yield the expected exploded shape and leads to a shape mismatch error when trying to align non-list columns.

Example

import polars as pl

df = pl.DataFrame(
    {
        "list_long": [[1, 2, 3], [4, 5, 6]],
        "sort_key": [2, 1],
    }
)

print(df.sql("SELECT UNNEST(list_long), sort_key FROM self"))

Old behaviour

polars.exceptions.ShapeError: Series length 2 doesn't match the DataFrame height of 6

New behaviour


shape: (6, 2)
┌───────────┬──────────┐
│ list_long ┆ sort_key │
│ ---       ┆ ---      │
│ i64       ┆ i64      │
╞═══════════╪══════════╡
│ 1         ┆ 2        │
│ 2         ┆ 2        │
│ 3         ┆ 2        │
│ 4         ┆ 1        │
│ 5         ┆ 1        │
│ 6         ┆ 1        │
└───────────┴──────────┘

Solution

Changed the sql unnest to use the DataFrame.explode() instead of the Expr.explode()/List.explode() method.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 1, 2025
Copy link

codecov bot commented May 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.89%. Comparing base (716c902) to head (d1a0fbc).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22546      +/-   ##
==========================================
- Coverage   80.93%   80.89%   -0.04%     
==========================================
  Files        1651     1656       +5     
  Lines      233014   234059    +1045     
  Branches     2752     2752              
==========================================
+ Hits       188599   189352     +753     
- Misses      43754    44047     +293     
+ Partials      661      660       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Felix-Blom Felix-Blom force-pushed the main branch 6 times, most recently from 37b7864 to 56abf91 Compare May 3, 2025 18:17
feat: Change SQL-Explode to Dataframe.explode method
@Felix-Blom Felix-Blom changed the title feat: Change SQL-Explode to Dataframe.explode method feat: Change SQL-Explode/UNNEST to Dataframe.explode method May 7, 2025
@Felix-Blom
Copy link
Author

@alexander-beedie Based on previous commits, this seems like something right up your alley. Still learning, so I wanted to check a couple of my assumptions and lay them out to you (and others):

Regarding the ORDER BY tests in Python: the ordering now happens after the EXPLODE/UNNEST statement, which seems logically correct to me. However, I'm not entirely confident about any downstream implications this might have—does anything come to mind? The previous NULL checks seem redundant to me now, and the new behavior seems more in line with what i would expect to happen, but maybe my assumptions are wrong.

I did test some of my assumptions in other sql tools, which showed similar results to my expectations!

import duckdb
import pandas as pd

conn = duckdb.connect(database=":memory:")
df = pd.DataFrame.from_dict({"list_long": [[1, 2, 3], [4, 5, 6]], "sort_key": [2, 1]})

results = duckdb.sql("SELECT UNNEST(list_long) FROM df").df()
results2 = duckdb.sql("SELECT UNNEST(list_long), sort_key FROM df").df()
results3 = duckdb.sql(
    "SELECT UNNEST(list_long), sort_key FROM df ORDER BY sort_key"
).df()
print(results3)

Curious on your thoughts! Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change SQL unnest to map to Dataframe.explode()
1 participant