[D] Feature selection with categorical & continuous features
If this is too elementary and you’d like me to bother the r/learnmachinelearning people, just let me know.
We have a binary classification problem, with a dataset of size N ~ 100, and p ~ 50 features. Some of the features are categorical, thus we one-hot encode them to binary columns using
sklearn.preprocessing.OneHotEncoder, which increases p unreasonably (since a categorical variable with 10 levels is expanded to 10 columns). Building a random forest or a XGBoost classifier, using a subset of the features chosen by a subject matter expert (SME), works quite well on this dataset, where with “works quite well” I mean “it does significantly better than predicting the majority class or using logistic regression”.
Now, instead than leaving the feature selection to the SME, the data scientist who’s working on this project would like to perform “proper” feature selection because some features look highly correlated (and thus the feature importance measures generated by the random forests are unreliable). How do you do feature selection when you 1) have both continuous and categorical variables with many levels, and 2) you are using a non-additive model such as random forests or XGBoost? If this was a generalized linear models, then it would be straightforward to perform feature selection by just introducing L_1 or L_2 regularization. However, I’m not sure how to do this in a principled way when using nonlinear, non-additive models such as random forests & XGBoost.