Building indices removes user defined metadata #489

pavankumar-jamanjyothi-by · 2021-08-03T07:12:28Z

Description:

When building indices for an existing dataset via build_dataset_indices methods, user-defined metadata (the "metadata" key in the deserialized by-metadata.json file) is removed. The reason is that build_dataset_indices functions pass load_dataset_metadata=False to the DatasetFactory, e.g. here:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io/eager.py#L817-L822
This has the effect of actively removing user-defined metadata:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/core/factory.py#L99-L100
so no metadata is written when the by-metadata.json file is written in the end.

Fix is to pass load_dataset_metadata=True to the DatasetFactory.

steffen-schroeder-by

Thanks, all in all, looks good to me. There are 2 things, I'd like to see in addition:

this is worth an entry in the changelog
we should understand why load_dataset_metadata was always set to False and what implication it has to set it to True as default now. (Functionality/Performance/...). Maybe @fjetter has an idea.

pavankumar-jamanjyothi-by · 2021-08-03T09:18:31Z

kartothek/io_components/gc.py

@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory):
        dataset_uuid=dataset_uuid,
        store=store_factory,
        factory=factory,
-        load_dataset_metadata=False,


@jochen-ott-by As this is gc, I think we should set it to False here.

I think it does not really matter and we can use load_dataset_meadata=True everywhere.

jochen-ott-by · 2021-08-24T10:55:08Z

kartothek/io_components/gc.py

@@ -10,7 +10,7 @@ def dispatch_files_to_gc(dataset_uuid, store_factory, chunk_size, factory):
        dataset_uuid=dataset_uuid,
        store=store_factory,
        factory=factory,
-        load_dataset_metadata=False,


I think it does not really matter and we can use load_dataset_meadata=True everywhere.

jochen-ott-by · 2021-08-24T10:59:21Z

kartothek/io/testing/index.py

+        {"label": "cluster_1", "data": [("core", pd.DataFrame({"p": [1, 2]}))]},
+        {"label": "cluster_2", "data": [("core", pd.DataFrame({"p": [2, 3]}))]},
+    ]
+    with freeze_time(TIME_TO_FREEZE_ISO):


IIRC, we had some issues with the freeze_time approach in the past, which is why almost no test nowadays uses it. I think this test can be re-written without using freeze_time, simply by not checking a value for metadata["creation_time"]. This would not only drop the dependency on freezegun here, but also make the test clearer.

Makes sense. Removed freeze_time and pushed the changes.

fixed removing metadata while building indices.

23d1e95

pavankumar-jamanjyothi-by requested review from ilia-zaitcev-by, fjetter, lr4d, aaron-tal-by, florian-jetter-by, jakob-ernst-by, johan-olsson-by and steffen-schroeder-by August 3, 2021 07:12

steffen-schroeder-by requested changes Aug 3, 2021

View reviewed changes

pavankumar-jamanjyothi-by commented Aug 3, 2021

View reviewed changes

jochen-ott-by requested changes Aug 24, 2021

View reviewed changes

added tests to assert metadata is not lost after adding indices

0618558

pavankumar-jamanjyothi-by force-pushed the building-indices-removes-user-defined-metadata branch from 1c8ed8b to 0618558 Compare September 1, 2021 09:37

pavankumar-jamanjyothi-by requested review from jochen-ott-by and steffen-schroeder-by and removed request for florian-jetter-by September 1, 2021 09:43

johan-olsson-by removed their request for review April 20, 2022 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building indices removes user defined metadata #489

Building indices removes user defined metadata #489

pavankumar-jamanjyothi-by commented Aug 3, 2021

steffen-schroeder-by left a comment

pavankumar-jamanjyothi-by Aug 3, 2021

jochen-ott-by Aug 24, 2021

jochen-ott-by Aug 24, 2021

jochen-ott-by Aug 24, 2021

pavankumar-jamanjyothi-by Sep 1, 2021

Building indices removes user defined metadata #489

Are you sure you want to change the base?

Building indices removes user defined metadata #489

Conversation

pavankumar-jamanjyothi-by commented Aug 3, 2021

Description:

steffen-schroeder-by left a comment

Choose a reason for hiding this comment

pavankumar-jamanjyothi-by Aug 3, 2021

Choose a reason for hiding this comment

jochen-ott-by Aug 24, 2021

Choose a reason for hiding this comment

jochen-ott-by Aug 24, 2021

Choose a reason for hiding this comment

jochen-ott-by Aug 24, 2021

Choose a reason for hiding this comment

pavankumar-jamanjyothi-by Sep 1, 2021

Choose a reason for hiding this comment