Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] How do you handle sparse features?

I am working on a problem where I have a sequence of events happening, every event generate a set of tokens (some of the tokens are shared between the events, but not all), the task is to categorize the behavior that generated this set of events.

Let me give you a simple example to have an understanding on the input.

event_type order value_type_1 value_1 value_type_2 value_2
E1 1 alpha 1 24 alpha 2 33
E2 2 beta 120
E1 3 alpha 1 234 alpha 2 56
E3 4 theta 150
E4 5

You can notice for example that the token “theta” doesn’t exist in event_type E2, it only exist in some event types.

If I want to do feature engineering in this case, what is the best way to vectorize my data. If I take the token, and try to put this way, I will end up with a very sparse features.

event_type order alpha 1 alpha 2 beta theta
E1 1 24 33
E2 2 120
E1 3 234 56
E3 4 150
E4 5

If I construct my features this way, it will be very sparse and it doesn’t make sense to consider it as missing data (because the data doesn’t exist in first place).

I don’t want to apply data imputation method such filling the last value (You can see below the example, I have added the number in bold to show it as an example) . The reason is that some event type are very frequent, and some event types are not.

event_type order alpha 1 alpha 2 beta theta
E1 1 24 33 0 0
E2 2 24 33 120 0
E1 3 234 56 120 0
E3 4 234 56 120 150
E4 5 234 56 120 150

If you were in my shoes, how would you treat this problem?. Ideas, references are welcomed.

If you are wondering what do I want to do, I want to categorize the behavior that generated this set of events. I can experiment with any method if I get feature engineering right (you can think of clustering as an example).

submitted by /u/__Julia
[link] [comments]