Handle "multicategorical" columns #227

davidfstein · 2024-08-26T18:23:51Z

The pytorch_frame library natively handles categorical variables where the variable may take on multiple categories simultaneously, e.g. row1 = [1, .5, ['a', 'b', 'c']], row2 = [2, .3, ['a']] ...

It would be a nice quality of life enhancement to have this sort of functionality added to the widedeep library.

I believe, though I need to look more carefully, that they do something along the lines of 1) label encode the categories 2) convert to tensors such that multicategorical feature a is replaced with an "embedding" of shape n rows x max categories for single row. Rows with variables taking on fewer than max categories for single row take -1 in the "missing" columns. I imagine there are other options for handling this also.

jrzaurin · 2024-08-26T20:53:30Z

Hey @davidfstein

I can look into this, but you can just consider the column that can take multiple categorical values as text and use this library as it is (?) Or turn the multicategorical columns into multiple columns if that is possible and proceed as usual?

But I will look into this :)

davidfstein · 2024-08-27T14:56:59Z

Thanks @jrzaurin ! Actually right now I am following your first suggestion and processing them as text. I was only concerned that this might become inefficient for many features if a separate RNN needs to be trained for each feature. As for splitting into multiple columns, I was thinking you might lose information if each column doesn't contain the full complement of possible categories, but I'm not sure if this concern would lead to substantive performance decrease or not.

jrzaurin · 2024-08-27T21:26:37Z

Let's see if I can put some functioning code tomorrow :)

davidfstein · 2024-08-27T21:45:18Z

That would be awesome! Thanks for the great library!

jrzaurin self-assigned this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle "multicategorical" columns #227

Handle "multicategorical" columns #227

davidfstein commented Aug 26, 2024

jrzaurin commented Aug 26, 2024 •

edited

Loading

davidfstein commented Aug 27, 2024

jrzaurin commented Aug 27, 2024

davidfstein commented Aug 27, 2024

Handle "multicategorical" columns #227

Handle "multicategorical" columns #227

Comments

davidfstein commented Aug 26, 2024

jrzaurin commented Aug 26, 2024 • edited Loading

davidfstein commented Aug 27, 2024

jrzaurin commented Aug 27, 2024

davidfstein commented Aug 27, 2024

jrzaurin commented Aug 26, 2024 •

edited

Loading