-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support metadata columns (location
, size
, last_modified
) in ListingTableProvider
#15181
Open
phillipleblanc
wants to merge
14
commits into
apache:main
Choose a base branch
from
phillipleblanc:phillip/250312-listing-table-metadata-cols
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Support metadata columns (location
, size
, last_modified
) in ListingTableProvider
#15181
phillipleblanc
wants to merge
14
commits into
apache:main
from
phillipleblanc:phillip/250312-listing-table-metadata-cols
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… ListingTableProvider (apache#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering
phillipleblanc
commented
Mar 12, 2025
Comment on lines
+464
to
+467
pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self { | ||
self.metadata_cols = metadata_cols; | ||
self | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main API change that consumers would use to enable these columns on the Listing Table. They aren't added by default.
location
, size
, last_modified
) in ListingTableProvider
11 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes Support metadata columns (
location
,size
,last_modified
) inListingTableProvider
#15173Also potentially solves No efficient way to load a subset of files from partitioned table #8906
Rationale for this change
The ListingTableProvider in DataFusion provides an implementation of a TableProvider that organizes a collection of (potentially hive partitioned) files in an object store into a single table.
Similar to how hive partitions are injected into the listing table schema, but they don't actually exist in the physical parquet files - this PR adds the ability to request the ListingTable to inject metadata columns that get their data from the ObjectMeta provided by the object store crate. That allows consumers to opt-in for the requested metadata columns.
Note: This is related to the ongoing work in #13975 / #14057 / #14362 -- these new metadata columns could be marked as proper system/metadata columns as defined in those PRs - but I don't see that as a prerequisite for this change. Since this would be an opt-in from the consumer, automatic filtering out on a SELECT * doesn't seem required. We could consider automatically enabling these if we decide on proper support for system columns.
What changes are included in this PR?
I've added a new API on the ListingOptions struct that is passed to a ListingTableConfig which is passed to ListingTable::try_new.
That controls whether the ListingTableProvider will add the metadata columns to the schema, similar to how partition columns are added.
The definition for
MetadataColumn
is a simple enum:Filters on metadata columns directly can be used to prune out files that don't need to be read - i.e.
SELECT * FROM my_listing_table WHERE last_modified > '2025-03-10'
will only scan files that were modified after '2025-03-10'.Are these changes tested?
Yes, I've added tests in several places (including adding tests for functions that I've changed that didn't previously exist).
Are there any user-facing changes?
The main change is adding the
with_metadata_cols
API on theListingOptions
struct. This is not a breaking change, as the current behavior will be to not add any metadata columns unlesswith_metadata_cols
is explicitly called.