Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "multicategorical" columns #227

Open
davidfstein opened this issue Aug 26, 2024 · 4 comments
Open

Handle "multicategorical" columns #227

davidfstein opened this issue Aug 26, 2024 · 4 comments
Assignees

Comments

@davidfstein
Copy link

The pytorch_frame library natively handles categorical variables where the variable may take on multiple categories simultaneously, e.g. row1 = [1, .5, ['a', 'b', 'c']], row2 = [2, .3, ['a']] ...

It would be a nice quality of life enhancement to have this sort of functionality added to the widedeep library.

I believe, though I need to look more carefully, that they do something along the lines of 1) label encode the categories 2) convert to tensors such that multicategorical feature a is replaced with an "embedding" of shape n rows x max categories for single row. Rows with variables taking on fewer than max categories for single row take -1 in the "missing" columns. I imagine there are other options for handling this also.

@jrzaurin
Copy link
Owner

jrzaurin commented Aug 26, 2024

Hey @davidfstein

I can look into this, but you can just consider the column that can take multiple categorical values as text and use this library as it is (?) Or turn the multicategorical columns into multiple columns if that is possible and proceed as usual?

But I will look into this :)

@jrzaurin jrzaurin self-assigned this Aug 27, 2024
@davidfstein
Copy link
Author

Thanks @jrzaurin ! Actually right now I am following your first suggestion and processing them as text. I was only concerned that this might become inefficient for many features if a separate RNN needs to be trained for each feature. As for splitting into multiple columns, I was thinking you might lose information if each column doesn't contain the full complement of possible categories, but I'm not sure if this concern would lead to substantive performance decrease or not.

@jrzaurin
Copy link
Owner

Let's see if I can put some functioning code tomorrow :)

@davidfstein
Copy link
Author

That would be awesome! Thanks for the great library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@jrzaurin @davidfstein and others