Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

talawahtech · 2025-05-03T07:51:42Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq
from io import BytesIO

with BytesIO() as f:
    df = pl.DataFrame([
        pl.Series('enum_col',['a','b','c'], pl.Enum(["a","b","c"])),
        pl.Series('cat_col',['a','b','c'], pl.Categorical(ordering="lexical")),
    ])
    
    df.write_parquet(f, use_pyarrow=True)
    print("Arrow Schema:\n", pq.ParquetFile(f).schema_arrow, "\n", sep="")
    
    df = pl.read_parquet(f)
    print("Polars Schema:\n", df.schema, sep="")

Log output

Arrow Schema:
enum_col: dictionary<values=string, indices=int32, ordered=0>
cat_col: dictionary<values=string, indices=int32, ordered=0>

Polars Schema:
Schema({'enum_col': Categorical(ordering='physical'), 'cat_col': Categorical(ordering='physical')})

Issue description

When writing a parquet file using PyArrow (use_pyarrow=True) the arrow metadata that Polars uses to preserve type information for Enums and lexically ordered Categoricals is omitted, so the type information is lost.

Related: #2732, #13260

Expected behavior

Arrow Schema:
enum_col: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  _PL_ENUM_VALUES: '1;a1;b1;c'
cat_col: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  _PL_CATEGORICAL: 'lexical'

Polars Schema:
Schema({'enum_col': Enum(categories=['a', 'b', 'c']), 'cat_col': Categorical(ordering='lexical')})

Installed versions

--------Version info---------
Polars:              1.29.0
Index type:          UInt32
Platform:            Linux-5.10.235-.xxxxx.x86_64-x86_64-with-glibc2.26
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.79
cloudpickle          <not installed>
connectorx           0.4.0
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.0
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              18.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

talawahtech · 2025-05-03T07:52:54Z

@coastalwhite I think this should probably be tracked by #20089 as well.

talawahtech · 2025-05-04T00:27:39Z

Looking at the source code for write_parquet() it seems the Dataframe gets converted to a PyArrow table, which is used to populate a dict, which is the converted to a pa.Table again. Seems like it is intended to handle cases where the column name is None?

            tbl = self.to_arrow()
            data = {}

            for i, column in enumerate(tbl):
                # extract the name before casting
                name = f"column_{i}" if column._name is None else column._name

                data[name] = column

            tbl = pa.table(data)

That is where the metadata gets truncated. Using the original pa.Table directly preserves the metadata. e.g.

import polars as pl
import pyarrow.parquet as pq
from io import BytesIO

with BytesIO() as f:
    df = pl.DataFrame([
        pl.Series('enum_col',['a','b','c'], pl.Enum(["a","b","c"])),
        pl.Series('cat_col',['a','b','c'], pl.Categorical(ordering="lexical")),
    ])
    
    table = df.to_arrow()
    pq.write_table(table, f)
    print("Arrow Schema:\n", pq.ParquetFile(f).schema_arrow, "\n", sep="")
    
    df = pl.read_parquet(f)
    print("Polars Schema:\n", df.schema, sep="")

Seems to work as expected. Not sure if there are any other unintended consequences of using the output of df.to_arrow() directly that the dict/loop approach avoids.

talawahtech added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 3, 2025

talawahtech changed the title ~~Enums and Categoricals with lexical ordering not preserved by write_parquet when use_pyarrow=True~~ Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True May 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

talawahtech commented May 3, 2025 •

edited

Loading

talawahtech commented May 3, 2025

talawahtech commented May 4, 2025 •

edited

Loading

Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

Comments

talawahtech commented May 3, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

talawahtech commented May 3, 2025

talawahtech commented May 4, 2025 • edited Loading

talawahtech commented May 3, 2025 •

edited

Loading

talawahtech commented May 4, 2025 •

edited

Loading