[D] Is this a correct way to test the inclusion of new feature variables in a model?

Written by torontoai on July 8, 2019. Posted in Reddit MachineLearning.

Hello. I have a model in XGBoost and as a way of making little improvements, I have been testing the introduction of new variables in the following way:

-Make a k-fold CrossValidation process with a new variable (now onwards X), so that I get k values of a, for instance, recall (or any other metric, F1-score, F2, whatever), stored in X_list.

-Drop the variable X and make a k-fold cross validation process, ending with another list of k recall values, called Y_list.

-Make a hypothesis test to compare recall of first sample (X_list) against second sample (Y_list).

-If the hypothesis test says that X_list have greater recall than Y_list, I take X as a useful feature and I keep it.

Is there any error in the reasoning of this part? Anything to improve?

If I want to test another variable, I keep X and I introduce the variable Z and make the same process. In the case that the test gives Z as useful, I also keep it. Let’s imagine that I also introduce the variable W. Until now, I have been keeping both X,Z and W as new variables, but I have a doubt:

Having improvements with X,Z and W in the way I had them, means that that is the best combination of variables? Or maybe if I had tested W right after X, W or Z wouldn’t have throw better results? Should I test every possible combination among those 3 variables or it is okay the way I am doing it?

Thank you very much

submitted by /u/L3GOLAS234
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] Is this a correct way to test the inclusion of new feature variables in a model?