Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add "OTHER" category to the encoders when min_frequency is supplied #178

Open
vspinu opened this issue Mar 6, 2025 · 0 comments
Open

Comments

@vspinu
Copy link

vspinu commented Mar 6, 2025

Currently (as per v0.1.3) categorical encoders don't handle the other categories when min_frequency is supplied. For example:

    import pandas as pd
    import ibis
    import ibis_ml as ml

    con = ibis.duckdb.connect()

    df = pd.DataFrame({
        'cat1': ['AA', 'BBB', 'AA', 'BBB', 'CCC'],
        'cat2': ['X', 'Y', 'Y', 'X', 'Z'],
        'value': [10, 20, 30, 40, 50]
    })

    tbl = con.create_table("tmp", df, overwrite=True)
 
    tr_ohe = ml.Recipe(
        ml.OneHotEncode(ml.string(), min_frequency=2),
    ).fit(tbl.drop("value"), tbl.value)
    
    tr_ohe.to_ibis(tbl).to_pandas()
     #    value  cat1_AA  cat1_BBB  cat2_X  cat2_Y
    # 0     10        1         0       1       0
    # 1     20        0         1       0       1
    # 2     30        1         0       0       1
    # 3     40        0         1       1       0
    # 4     50        0         0       0       0


    tr_oe = ml.Recipe(
        ml.OrdinalEncode(ml.string(), min_frequency=2),
        ml.FillNA(ml.integer(), "OTHER")       # <-- this does not work!!! 
    ).fit(tbl.drop("value"), tbl.value)
    
    tr_oe.to_ibis(tbl).to_pandas()
   #    value  cat1  cat2
    # 0     10   0.0   0.0
    # 1     20   1.0   1.0
    # 2     30   0.0   1.0
    # 3     40   1.0   0.0
    # 4     50   NaN   NaN

(Note that there is an additional issue that filling NaN does not work on int vectors)

Would it be possible to extend the encoders to be able to specify the "other" value? More concretely:

  • For OHE add the ability to generate the column which is 1 for all the swallowed categories.
  • For OHE and OE add "others_value" argument to indicate which value should the swallowed categories assume
@vspinu vspinu changed the title FEAT: Add "OTHER" category to the encoders when min_frequency is supplied feat: Add "OTHER" category to the encoders when min_frequency is supplied Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

1 participant