Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: alternative OHE using vectors #66

Open
jitingxu1 opened this issue Apr 15, 2024 · 2 comments
Open

feat: alternative OHE using vectors #66

jitingxu1 opened this issue Apr 15, 2024 · 2 comments

Comments

@jitingxu1
Copy link
Collaborator

background:

instead of a full sql translation of one-hot encoding algorithm, he was envisioning more of as a backend registered function, which probably will be more performant.

That requires SQL metaprogramming to first enumerate a list of unique values in the columns and then craft the create statement

additionally it pigeon holes you into one hot encoding having to return separate columns per value whereas many frameworks, i.e. Spark MLlib if I remember correctly, return something like a 2d vector in lieu of those columns (which can be handled much more efficiently from a memory / hardware perspective)

Some response:

When fitting a one-hot-encoder we already have to collect all the cases so they're consistent across all applications of transform.
The only difference here would be whether a one-hot-encoder should return a column-per-case or a column of an array of cases. I'd argue that since the consuming tooling will want a flat array, not special casing one-hot-encoding (for now) and returning a column-per-case is the correct approach.
Also note - ibisml already has a OneHotEncode step that does all this.

@deepyaman
Copy link
Collaborator

Related to #26

@deepyaman
Copy link
Collaborator

There are pros and cons of representing as separate columns vs. vectors. For now, we will wait for additional signal before switching to further leverage vector abstractions. In the short term, more backends cleanly support standard column operations than vector operations, and the aforementioned SQL metaprogramming to generate CREATE statements with a lot of columns is not as big of a drawback when using Python/Ibis.

Performance considerations should be evaluated more thoroughly at a later point.

@deepyaman deepyaman changed the title feat: Alternative OHE using vectors feat: alternative OHE using vectors Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

2 participants