Skip to content

Building indices removes user defined metadata #489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pavankumar-jamanjyothi-by

Description:

When building indices for an existing dataset via build_dataset_indices methods, user-defined metadata (the "metadata" key in the deserialized by-metadata.json file) is removed. The reason is that build_dataset_indices functions pass load_dataset_metadata=False to the DatasetFactory, e.g. here:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io/eager.py#L817-L822
This has the effect of actively removing user-defined metadata:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/core/factory.py#L99-L100
so no metadata is written when the by-metadata.json file is written in the end.

Fix is to pass load_dataset_metadata=True to the DatasetFactory.

Copy link
Contributor

@steffen-schroeder-by steffen-schroeder-by left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all in all, looks good to me. There are 2 things, I'd like to see in addition:

  1. this is worth an entry in the changelog
  2. we should understand why load_dataset_metadata was always set to False and what implication it has to set it to True as default now. (Functionality/Performance/...). Maybe @fjetter has an idea.

@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory):
dataset_uuid=dataset_uuid,
store=store_factory,
factory=factory,
load_dataset_metadata=False,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jochen-ott-by As this is gc, I think we should set it to False here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not really matter and we can use load_dataset_meadata=True everywhere.

@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory):
dataset_uuid=dataset_uuid,
store=store_factory,
factory=factory,
load_dataset_metadata=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not really matter and we can use load_dataset_meadata=True everywhere.

{"label": "cluster_1", "data": [("core", pd.DataFrame({"p": [1, 2]}))]},
{"label": "cluster_2", "data": [("core", pd.DataFrame({"p": [2, 3]}))]},
]
with freeze_time(TIME_TO_FREEZE_ISO):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we had some issues with the freeze_time approach in the past, which is why almost no test nowadays uses it. I think this test can be re-written without using freeze_time, simply by not checking a value for metadata["creation_time"]. This would not only drop the dependency on freezegun here, but also make the test clearer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Removed freeze_time and pushed the changes.

@pavankumar-jamanjyothi-by pavankumar-jamanjyothi-by force-pushed the building-indices-removes-user-defined-metadata branch from 1c8ed8b to 0618558 Compare September 1, 2021 09:37
@johan-olsson-by johan-olsson-by removed their request for review April 20, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants