Skip to content

feat(python): Add cast_options parameter to control type casting in scan_parquet #22617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 9, 2025

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented May 5, 2025

Introduces cast_options parameter to allow users to scan a list of files with differing schemas. Will later be used by Iceberg / Delta scans.

Error messages will now also hint to this - e.g.

SchemaError: data type mismatch for column literal: incoming: Datetime(Nanoseconds, Some("Australia/Sydney")) != target: Datetime(Milliseconds, Some("Europe/Amsterdam")), hint: pass cast_options=pl.ScanCastOptions(datetime_cast='convert-timezone')
SchemaError: data type mismatch for column literal: incoming: Float64 != target: Float32, hint: pass cast_options=pl.ScanCastOptions(float_cast='downcast')

API Examples

pl.scan_parquet(
    path,
    cast_options=pl.ScanCastOptions(datetime_cast="nanosecond-downcast"),
)

# Full set of options available from this PR:
pl.scan_parquet(
    path,
    cast_options=pl.ScanCastOptions(
        integer_cast="upcast",              # Allow lossless integer->integer casting
        float_cast=["upcast", "downcast"],  # Allow Float64<->Float32 (both ways)

        # datetime_cast options:
        # * nanosecond-downcast: Equivalent of pyarrow's `coerce_int96_timestamp_unit`
        # * convert-timezone: Allow casting to convert timezone
        datetime_cast=["nanosecond-downcast", "convert-timezone"],

        missing_struct_fields="insert",     # Inserts missing struct fields. Also accepts "raise" to instead error.
        extra_struct_fields="ignore",       # Ignores extra struct fields. Also accepts "raise" to instead error.
    ),
)

# Default options:
pl.scan_parquet(
    path,
    cast_options=pl.ScanCastOptions(
        integer_cast="forbid",
        float_cast="forbid",
        datetime_cast="forbid",
        missing_struct_fields="raise",
        extra_struct_fields="raise",
    ),
)

# Ideas for future parameters:
# * datetime_cast: ["upcast-strict", "upcast-non-strict"]
#   * Allows casting to higher precision, where strict will error if the timestamp goes out of range
# * integer_cast: ["downcast-strict", "downcast-overflowing", "downcast-non-strict"]
#   * Allows casting to smaller types. Not sure if we want these any time soon.
#
# Or, maybe as a separate parameter:
# * integer_downcast: Literal["strict", "non-strict", "overflowing", "forbid"] = "forbid"

Pinging @alexander-beedie and @MarcoGorelli for review on the Python API

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 5, 2025
@nameexhaustion nameexhaustion changed the title feat: Add parameter to control type casting in scan_parquet feat(python): Add parameter to control type casting in scan_parquet May 5, 2025
@nameexhaustion nameexhaustion removed the rust Related to Rust Polars label May 5, 2025
@nameexhaustion nameexhaustion force-pushed the parquet-scan-type-cast branch from 93a2844 to 9401600 Compare May 6, 2025 12:17
Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 87.12329% with 47 lines in your changes missing coverage. Please review.

Project coverage is 80.99%. Comparing base (e71569a) to head (8be5a28).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/conversion/mod.rs 72.07% 31 Missing ⚠️
...ources/multi_file_reader/extra_ops/cast_columns.rs 90.80% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22617      +/-   ##
==========================================
+ Coverage   80.98%   80.99%   +0.01%     
==========================================
  Files        1661     1662       +1     
  Lines      234869   235182     +313     
  Branches     2773     2774       +1     
==========================================
+ Hits       190198   190477     +279     
- Misses      44004    44037      +33     
- Partials      667      668       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nameexhaustion nameexhaustion changed the title feat(python): Add parameter to control type casting in scan_parquet feat(python): Add cast_options parameter to control type casting in scan_parquet May 6, 2025
@nameexhaustion nameexhaustion marked this pull request as ready for review May 6, 2025 14:55
@nameexhaustion nameexhaustion force-pushed the parquet-scan-type-cast branch from eaa2552 to a97a8c9 Compare May 7, 2025 04:55
@nameexhaustion nameexhaustion marked this pull request as draft May 7, 2025 12:36
@nameexhaustion nameexhaustion force-pushed the parquet-scan-type-cast branch from a97a8c9 to 3b5d2a5 Compare May 7, 2025 12:36
@nameexhaustion nameexhaustion force-pushed the parquet-scan-type-cast branch from 3b5d2a5 to 8be5a28 Compare May 7, 2025 12:36
@@ -13,7 +13,7 @@ use super::*;
// (Major, Minor)
// Add a field -> increment minor
// Remove or modify a field -> increment major and reset minor
pub static DSL_VERSION: (u16, u16) = (2, 0);
pub static DSL_VERSION: (u16, u16) = (3, 0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped major DSL version as I've changed the CastColumnsPolicy (part of unified_scan_args).

@nameexhaustion nameexhaustion marked this pull request as ready for review May 7, 2025 12:48
@ritchie46 ritchie46 merged commit 50185f4 into pola-rs:main May 9, 2025
30 checks passed
@teotwaki
Copy link
Contributor

teotwaki commented May 9, 2025

@ritchie46 @nameexhaustion This appears to have broken main again for me.

cargo init --bin foobar
cd foobar
cargo add --git https://github.com/pola-rs/polars.git polars -F lazy
cargo build

Results in

   Compiling polars-core v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-ops v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-plan v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-stream v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-lazy v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-error v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-row v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-parquet v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-time v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
   Compiling polars-io v0.47.1 (https://github.com/pola-rs/polars.git#da27decd)
error[E0599]: no variant or associated item named `StringExpr` found for enum `dsl::function_expr::FunctionExpr` in the current scope
   --> /Users/slau/.cargo/git/checkouts/polars-54176cdfb679e240/da27dec/crates/polars-plan/src/plans/aexpr/properties.rs:195:43
    |
195 | ...                   FunctionExpr::StringExpr(StringFunction::Strptime(_, strptime_options)),
    |                                     ^^^^^^^^^^ variant or associated item not found in `FunctionExpr`
    |
   ::: /Users/slau/.cargo/git/checkouts/polars-54176cdfb679e240/da27dec/crates/polars-plan/src/dsl/function_expr/mod.rs:123:1
    |
123 | pub enum FunctionExpr {
    | --------------------- variant or associated item `StringExpr` not found for this enum

error[E0433]: failed to resolve: use of undeclared type `StringFunction`
   --> /Users/slau/.cargo/git/checkouts/polars-54176cdfb679e240/da27dec/crates/polars-plan/src/plans/aexpr/properties.rs:195:54
    |
195 | ...                   FunctionExpr::StringExpr(StringFunction::Strptime(_, strptime_options)),
    |                                                ^^^^^^^^^^^^^^
    |                                                |
    |                                                use of undeclared type `StringFunction`
    |                                                help: an enum with a similar name exists: `StatsFunction`

Some errors have detailed explanations: E0433, E0599.
For more information about an error, try `rustc --explain E0433`.
error: could not compile `polars-plan` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

This seems to indicate that feature strings is missing, or maybe this should be gated on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scan cast options API draft
3 participants