Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support metadata columns (location, size, last_modified) in ListingTableProvider #15181

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

phillipleblanc
Copy link
Contributor

@phillipleblanc phillipleblanc commented Mar 12, 2025

Which issue does this PR close?

Rationale for this change

The ListingTableProvider in DataFusion provides an implementation of a TableProvider that organizes a collection of (potentially hive partitioned) files in an object store into a single table.

Similar to how hive partitions are injected into the listing table schema, but they don't actually exist in the physical parquet files - this PR adds the ability to request the ListingTable to inject metadata columns that get their data from the ObjectMeta provided by the object store crate. That allows consumers to opt-in for the requested metadata columns.

Note: This is related to the ongoing work in #13975 / #14057 / #14362 -- these new metadata columns could be marked as proper system/metadata columns as defined in those PRs - but I don't see that as a prerequisite for this change. Since this would be an opt-in from the consumer, automatic filtering out on a SELECT * doesn't seem required. We could consider automatically enabling these if we decide on proper support for system columns.

What changes are included in this PR?

I've added a new API on the ListingOptions struct that is passed to a ListingTableConfig which is passed to ListingTable::try_new.

    /// Set metadata columns on [`ListingOptions`] and returns self.
    ///
    /// "metadata columns" are columns that are computed from the `ObjectMeta` of the files from object store.
    ///
    /// Available metadata columns:
    /// - `location`: The full path to the object
    /// - `last_modified`: The last modified time
    /// - `size`: The size in bytes of the object
    ///
    /// For example, given the following files in object store:
    ///
    /// ```text
    /// /mnt/nyctaxi/tripdata01.parquet
    /// /mnt/nyctaxi/tripdata02.parquet
    /// /mnt/nyctaxi/tripdata03.parquet
    /// ```
    ///
    /// If the `last_modified` field in the `ObjectMeta` for `tripdata01.parquet` is `2024-01-01 12:00:00`,
    /// then the table schema will include a column named `last_modified` with the value `2024-01-01 12:00:00`
    /// for all rows read from `tripdata01.parquet`.
    ///
    /// | <other columns> | last_modified         |
    /// |-----------------|-----------------------|
    /// | ...             | 2024-01-01 12:00:00   |
    /// | ...             | 2024-01-02 15:30:00   |
    /// | ...             | 2024-01-03 09:15:00   |
    ///
    /// # Example
    /// ```
    /// # use std::sync::Arc;
    /// # use datafusion::datasource::{listing::ListingOptions, file_format::parquet::ParquetFormat};
    ///
    /// let listing_options = ListingOptions::new(Arc::new(
    ///     ParquetFormat::default()
    ///   ))
    ///   .with_metadata_cols(vec![MetadataColumn::LastModified]);
    ///
    /// assert_eq!(listing_options.metadata_cols, vec![MetadataColumn::LastModified]);
    /// ```
    pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self {
        self.metadata_cols = metadata_cols;
        self
    }

That controls whether the ListingTableProvider will add the metadata columns to the schema, similar to how partition columns are added.

The definition for MetadataColumn is a simple enum:

/// A metadata column that can be used to filter files
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum MetadataColumn {
    /// The location of the file in object store
    Location,
    /// The last modified timestamp of the file
    LastModified,
    /// The size of the file in bytes
    Size,
}

Filters on metadata columns directly can be used to prune out files that don't need to be read - i.e. SELECT * FROM my_listing_table WHERE last_modified > '2025-03-10' will only scan files that were modified after '2025-03-10'.

Are these changes tested?

Yes, I've added tests in several places (including adding tests for functions that I've changed that didn't previously exist).

Are there any user-facing changes?

The main change is adding the with_metadata_cols API on the ListingOptions struct. This is not a breaking change, as the current behavior will be to not add any metadata columns unless with_metadata_cols is explicitly called.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Mar 12, 2025
Comment on lines +464 to +467
pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self {
self.metadata_cols = metadata_cols;
self
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main API change that consumers would use to enable these columns on the Listing Table. They aren't added by default.

@phillipleblanc phillipleblanc changed the title Support metadata columns (location, size, last_modified) in ListingTableProvider Support metadata columns (location, size, last_modified) in ListingTableProvider Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support metadata columns (location, size, last_modified) in ListingTableProvider
1 participant