-
Notifications
You must be signed in to change notification settings - Fork 53
Building indices removes user defined metadata #489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Building indices removes user defined metadata #489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, all in all, looks good to me. There are 2 things, I'd like to see in addition:
- this is worth an entry in the changelog
- we should understand why
load_dataset_metadata
was always set toFalse
and what implication it has to set it toTrue
as default now. (Functionality/Performance/...). Maybe @fjetter has an idea.
@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory): | |||
dataset_uuid=dataset_uuid, | |||
store=store_factory, | |||
factory=factory, | |||
load_dataset_metadata=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jochen-ott-by As this is gc, I think we should set it to False
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does not really matter and we can use load_dataset_meadata=True
everywhere.
@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory): | |||
dataset_uuid=dataset_uuid, | |||
store=store_factory, | |||
factory=factory, | |||
load_dataset_metadata=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does not really matter and we can use load_dataset_meadata=True
everywhere.
kartothek/io/testing/index.py
Outdated
{"label": "cluster_1", "data": [("core", pd.DataFrame({"p": [1, 2]}))]}, | ||
{"label": "cluster_2", "data": [("core", pd.DataFrame({"p": [2, 3]}))]}, | ||
] | ||
with freeze_time(TIME_TO_FREEZE_ISO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we had some issues with the freeze_time
approach in the past, which is why almost no test nowadays uses it. I think this test can be re-written without using freeze_time
, simply by not checking a value for metadata["creation_time"]
. This would not only drop the dependency on freezegun
here, but also make the test clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Removed freeze_time
and pushed the changes.
1c8ed8b
to
0618558
Compare
Description:
When building indices for an existing dataset via build_dataset_indices methods, user-defined metadata (the "metadata" key in the deserialized by-metadata.json file) is removed. The reason is that build_dataset_indices functions pass
load_dataset_metadata=False
to the DatasetFactory, e.g. here:https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io/eager.py#L817-L822
This has the effect of actively removing user-defined metadata:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/core/factory.py#L99-L100
so no metadata is written when the by-metadata.json file is written in the end.
Fix is to pass
load_dataset_metadata=True
to the DatasetFactory.