[D] How do you handle sparse features?
I am working on a problem where I have a sequence of events happening, every event generate a set of tokens (some of the tokens are shared between the events, but not all), the task is to categorize the behavior that generated this set of events.
Let me give you a simple example to have an understanding on the input.
event_type | order | value_type_1 | value_1 | value_type_2 | value_2 |
---|---|---|---|---|---|
E1 | 1 | alpha 1 | 24 | alpha 2 | 33 |
E2 | 2 | beta | 120 | ||
E1 | 3 | alpha 1 | 234 | alpha 2 | 56 |
E3 | 4 | theta | 150 | ||
E4 | 5 |
You can notice for example that the token “theta” doesn’t exist in event_type E2, it only exist in some event types.
If I want to do feature engineering in this case, what is the best way to vectorize my data. If I take the token, and try to put this way, I will end up with a very sparse features.
event_type | order | alpha 1 | alpha 2 | beta | theta |
---|---|---|---|---|---|
E1 | 1 | 24 | 33 | ||
E2 | 2 | 120 | |||
E1 | 3 | 234 | 56 | ||
E3 | 4 | 150 | |||
E4 | 5 |
If I construct my features this way, it will be very sparse and it doesn’t make sense to consider it as missing data (because the data doesn’t exist in first place).
I don’t want to apply data imputation method such filling the last value (You can see below the example, I have added the number in bold to show it as an example) . The reason is that some event type are very frequent, and some event types are not.
event_type | order | alpha 1 | alpha 2 | beta | theta |
---|---|---|---|---|---|
E1 | 1 | 24 | 33 | 0 | 0 |
E2 | 2 | 24 | 33 | 120 | 0 |
E1 | 3 | 234 | 56 | 120 | 0 |
E3 | 4 | 234 | 56 | 120 | 150 |
E4 | 5 | 234 | 56 | 120 | 150 |
If you were in my shoes, how would you treat this problem?. Ideas, references are welcomed.
If you are wondering what do I want to do, I want to categorize the behavior that generated this set of events. I can experiment with any method if I get feature engineering right (you can think of clustering as an example).
submitted by /u/__Julia
[link] [comments]