Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Best practice when dealing with feature pairs with strong Pearson Correlation scores

Lets say we have some features pairs that have strong Pearson Correlation scores that are:

  • Exactly +1 or -1 (lets call this T1 pairs)
  • Very close to being +1 or -1 or above a threshold (T2)
  • Correlating very closely like the T2s but with multiple other features but those features that its correlating with are not correlating with each other suspiciously (T3)

Lets call the threshold past which we say features pairs are T2 or T3 the TRESH.

Lets also make the following assumptions about these suspicious feature pairs:

  1. They are not One Hot Encoded or some kind of ordinal encoding
  2. They are all floating point numbers with high variances
  3. At least one of them has good correlation with the label(s)

What I would like to discuss is the following:

Options with T1, T2 and T3:

  1. Drop the one with a bad or lower correlation with the label(s)
  2. Drop one regardless
  3. Drop both and replace with a new feature that combines both: interaction

Options with T3:

  1. Drop the common feature if it is correlating badly or worse with the label(s) than any of the other features its correlating strongly with
  2. Drop the common feature regardless
  3. Drop all and interact the common feature with each of its buddies
  4. Drop all and interact the entire group with each other

Options with THRESH:

  1. Always a constant value (specify the value)
  2. A low custom value when you want to do feature reduction and a high custom value for when you want the most descriptive features only

Sample strategies:

  • T11 T21 T31 THRESH: 0.8 means drop worst T1 then drop worst T2 then drop worst T3 at constant suspicion threshold of +0.8 and -0.8
  • T11 T31 T23 THRESH: 2 means drop worst T1 then drop worst in T3 then interact remaining T2s with each other at a suspicion threshold that depends on what you seek to accomplish with the current dataset

Notice how the order matters. Please use this convention to make it fast and easy to understand what you think is best and then add your reasons. Feel free to suggest new options. I will add them to the post.

submitted by /u/times_of_change
[link] [comments]