[D] Is this a correct way to test the inclusion of new feature variables in a model?
Hello. I have a model in XGBoost and as a way of making little improvements, I have been testing the introduction of new variables in the following way:
-Make a k-fold CrossValidation process with a new variable (now onwards X), so that I get k values of a, for instance, recall (or any other metric, F1-score, F2, whatever), stored in X_list.
-Drop the variable X and make a k-fold cross validation process, ending with another list of k recall values, called Y_list.
-Make a hypothesis test to compare recall of first sample (X_list) against second sample (Y_list).
-If the hypothesis test says that X_list have greater recall than Y_list, I take X as a useful feature and I keep it.
Is there any error in the reasoning of this part? Anything to improve?
If I want to test another variable, I keep X and I introduce the variable Z and make the same process. In the case that the test gives Z as useful, I also keep it. Let’s imagine that I also introduce the variable W. Until now, I have been keeping both X,Z and W as new variables, but I have a doubt:
Having improvements with X,Z and W in the way I had them, means that that is the best combination of variables? Or maybe if I had tested W right after X, W or Z wouldn’t have throw better results? Should I test every possible combination among those 3 variables or it is okay the way I am doing it?
Thank you very much
submitted by /u/L3GOLAS234
[link] [comments]