Skip to content

Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True #22586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
talawahtech opened this issue May 3, 2025 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@talawahtech
Copy link

talawahtech commented May 3, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq
from io import BytesIO

with BytesIO() as f:
    df = pl.DataFrame([
        pl.Series('enum_col',['a','b','c'], pl.Enum(["a","b","c"])),
        pl.Series('cat_col',['a','b','c'], pl.Categorical(ordering="lexical")),
    ])
    
    df.write_parquet(f, use_pyarrow=True)
    print("Arrow Schema:\n", pq.ParquetFile(f).schema_arrow, "\n", sep="")
    
    df = pl.read_parquet(f)
    print("Polars Schema:\n", df.schema, sep="")

Log output

Arrow Schema:
enum_col: dictionary<values=string, indices=int32, ordered=0>
cat_col: dictionary<values=string, indices=int32, ordered=0>

Polars Schema:
Schema({'enum_col': Categorical(ordering='physical'), 'cat_col': Categorical(ordering='physical')})

Issue description

When writing a parquet file using PyArrow (use_pyarrow=True) the arrow metadata that Polars uses to preserve type information for Enums and lexically ordered Categoricals is omitted, so the type information is lost.

Related: #2732, #13260

Expected behavior

Arrow Schema:
enum_col: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  _PL_ENUM_VALUES: '1;a1;b1;c'
cat_col: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  _PL_CATEGORICAL: 'lexical'

Polars Schema:
Schema({'enum_col': Enum(categories=['a', 'b', 'c']), 'cat_col': Categorical(ordering='lexical')})

Installed versions

--------Version info---------
Polars:              1.29.0
Index type:          UInt32
Platform:            Linux-5.10.235-.xxxxx.x86_64-x86_64-with-glibc2.26
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.79
cloudpickle          <not installed>
connectorx           0.4.0
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.0
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              18.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@talawahtech talawahtech added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 3, 2025
@talawahtech
Copy link
Author

@coastalwhite I think this should probably be tracked by #20089 as well.

@talawahtech talawahtech changed the title Enums and Categoricals with lexical ordering not preserved by write_parquet when use_pyarrow=True Enums and Categoricals with lexical ordering are not preserved by write_parquet when use_pyarrow=True May 4, 2025
@talawahtech
Copy link
Author

talawahtech commented May 4, 2025

Looking at the source code for write_parquet() it seems the Dataframe gets converted to a PyArrow table, which is used to populate a dict, which is the converted to a pa.Table again. Seems like it is intended to handle cases where the column name is None?

            tbl = self.to_arrow()
            data = {}

            for i, column in enumerate(tbl):
                # extract the name before casting
                name = f"column_{i}" if column._name is None else column._name

                data[name] = column

            tbl = pa.table(data)

That is where the metadata gets truncated. Using the original pa.Table directly preserves the metadata. e.g.

import polars as pl
import pyarrow.parquet as pq
from io import BytesIO

with BytesIO() as f:
    df = pl.DataFrame([
        pl.Series('enum_col',['a','b','c'], pl.Enum(["a","b","c"])),
        pl.Series('cat_col',['a','b','c'], pl.Categorical(ordering="lexical")),
    ])
    
    table = df.to_arrow()
    pq.write_table(table, f)
    print("Arrow Schema:\n", pq.ParquetFile(f).schema_arrow, "\n", sep="")
    
    df = pl.read_parquet(f)
    print("Polars Schema:\n", df.schema, sep="")

Seems to work as expected. Not sure if there are any other unintended consequences of using the output of df.to_arrow() directly that the dict/loop approach avoids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant