[D] How do you handle sparse features?

Written by torontoai on November 2, 2019. Posted in Reddit MachineLearning.

I am working on a problem where I have a sequence of events happening, every event generate a set of tokens (some of the tokens are shared between the events, but not all), the task is to categorize the behavior that generated this set of events.

Let me give you a simple example to have an understanding on the input.

event_type	order	value_type_1	value_1	value_type_2	value_2
E1	1	alpha 1	24	alpha 2	33
E2	2	beta	120
E1	3	alpha 1	234	alpha 2	56
E3	4	theta	150
E4	5

You can notice for example that the token “theta” doesn’t exist in event_type E2, it only exist in some event types.

If I want to do feature engineering in this case, what is the best way to vectorize my data. If I take the token, and try to put this way, I will end up with a very sparse features.

event_type	order	alpha 1	alpha 2	beta	theta
E1	1	24	33
E2	2			120
E1	3	234	56
E3	4				150
E4	5

If I construct my features this way, it will be very sparse and it doesn’t make sense to consider it as missing data (because the data doesn’t exist in first place).

I don’t want to apply data imputation method such filling the last value (You can see below the example, I have added the number in bold to show it as an example) . The reason is that some event type are very frequent, and some event types are not.

event_type	order	alpha 1	alpha 2	beta	theta
E1	1	24	33	0	0
E2	2	24	33	120	0
E1	3	234	56	120	0
E3	4	234	56	120	150
E4	5	234	56	120	150

If you were in my shoes, how would you treat this problem?. Ideas, references are welcomed.

If you are wondering what do I want to do, I want to categorize the behavior that generated this set of events. I can experiment with any method if I get feature engineering right (you can think of clustering as an example).

submitted by /u/__Julia
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] How do you handle sparse features?