Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Is this a correct way to test the inclusion of new feature variables in a model?

Hello. I have a model in XGBoost and as a way of making little improvements, I have been testing the introduction of new variables in the following way:

-Make a k-fold CrossValidation process with a new variable (now onwards X), so that I get k values of a, for instance, recall (or any other metric, F1-score, F2, whatever), stored in X_list.

-Drop the variable X and make a k-fold cross validation process, ending with another list of k recall values, called Y_list.

-Make a hypothesis test to compare recall of first sample (X_list) against second sample (Y_list).

-If the hypothesis test says that X_list have greater recall than Y_list, I take X as a useful feature and I keep it.

Is there any error in the reasoning of this part? Anything to improve?

If I want to test another variable, I keep X and I introduce the variable Z and make the same process. In the case that the test gives Z as useful, I also keep it. Let’s imagine that I also introduce the variable W. Until now, I have been keeping both X,Z and W as new variables, but I have a doubt:

Having improvements with X,Z and W in the way I had them, means that that is the best combination of variables? Or maybe if I had tested W right after X, W or Z wouldn’t have throw better results? Should I test every possible combination among those 3 variables or it is okay the way I am doing it?

Thank you very much

submitted by /u/L3GOLAS234
[link] [comments]