[D] Does training data have to be randomly sampled/representative of the population?
I am a survey statistician by profession, but I am incorporating more and more machine learning techniques into my work and I am curious how survey sampling may affect the results of various machine learning models (e.g., DT, Random Forest, SVM, GBT, K-NN, etc.)
What I mean by survey sampling is that we often employ sampling techniques on a target population that on its own will return data that looks nothing like the population we are trying to study.
In the more simple situations, it could be an over-sample of a rare minority population that we would never get naturally through simple random samples. Or potentially more complex sampling, either sampling proportional to a variables of interest (e.g., revenue of a company) or sampling area clusters to reduce costs of in-person data collection (e.g., cheaper to survey 100 people in 3 states, rather than surveying 100 people in 50 states…) or a combination of any number of sampling techniques.
In survey statistics we usually use special procedures in combination with case weights and sample design variables to “fix” the known imbalances due to sampling so that the results look like the population of interest and we also take special care to properly inflate the variances (e.g., confidence intervals) of all of our estimates due to non-random sampling (e.g., a non random sample of 99 men & 1 woman, is not the same as a random sample that returns 50 men & 50 women, even though the total is n = 100 in both cases).
But I am not sure how that would influence various machine learning algorithms since their goals are slightly different and have less focus on p-values and confidence intervals.