Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Success rate/scoring of categorical features as features? [Discussion]

Hi all,

Let’s say I have a dataset with a mix of continuous and categorical variables and I’m creating an imbalanced binary classification model. A certain categorical variable has many (1000’s) of non-ordinal values. This feature is very important, certain values have high success rates ( num success(num rows with value and flag of 1)) / num instances (all rows with value) ). I have created features that include cat_var_num_success, cat_var_ success_rate, and a feature that is a score of each value of the categorical feature. The score assigns the mean overall success rate as the score if the value low sampling, if the value has a sufficient number of observations, and the success rate is greater than the mean overall success rate, the score is raised, the score is lowered if the value performs worse than the overall mean.

These generated features have proven to be highly predictive and improve the model (xgb). My concern comes from the fact that I have calculated these values using the whole dataset, which I subsequently split into train-test. I am afraid that the performance of the model is increasing due to information testing information leaking into training via the generated features.

Should I create an additional holdout set which does not contribute to the calculations for feature generation and test on that?

Thoughts?

Any feedback is appreciated!

submitted by /u/JohnnyCaggz
[link] [comments]