Success rate/scoring of categorical features as features? [Discussion]
Let’s say I have a dataset with a mix of continuous and categorical variables and I’m creating an imbalanced binary classification model. A certain categorical variable has many (1000’s) of non-ordinal values. This feature is very important, certain values have high success rates ( num success(num rows with value and flag of 1)) / num instances (all rows with value) ). I have created features that include cat_var_num_success, cat_var_ success_rate, and a feature that is a score of each value of the categorical feature. The score assigns the mean overall success rate as the score if the value low sampling, if the value has a sufficient number of observations, and the success rate is greater than the mean overall success rate, the score is raised, the score is lowered if the value performs worse than the overall mean.
These generated features have proven to be highly predictive and improve the model (xgb). My concern comes from the fact that I have calculated these values using the whole dataset, which I subsequently split into train-test. I am afraid that the performance of the model is increasing due to information testing information leaking into training via the generated features.
Should I create an additional holdout set which does not contribute to the calculations for feature generation and test on that?
Any feedback is appreciated!