[D] The “test set” is nonsense
I often see ML practitioners, and even experts, pose the idea of the “test set” as the ultimate benchmark of a model’s performance. This is nonsense – and I’ll explain why.
Suppose you gather some data, label it, preprocess it, and compile it into a dataset. Now it’s time to split your data – train, validation, test; how will you do it?
- Random selection; may work well if the dataset is large enough
- Engineered selection – assign samples to each set according to some rationale
The ultimate goal of an ML model is generalization; as such, said ‘rationale’ could be:
- The best test set is comprised of the most “realistic” or “difficult” samples. Problem: model performance is harmed by artificially biasing the train set to exclude realistic/difficult samples.
- The best set split is, each gets same quality data. Justification: “poorer” quality usually means (a) noise; (b) low complexity (“too obvious”). Whatever the description, if you test the model on an information landscape ( / probability distribution) that it wasn’t trained on, the model may perform poorly simply because it “learned” little that’s relevant to the test set.
Thus, “split equally” should work best. Onto the problem: why do we use a test set at all? Because – we “fit” the validation set with our hyperparameters, and we need to test on “never seen” data to avoid bias. Indeed, agreed – the test set does suppress said bias. But here’s its red line: variance.
Direct statistical theory: a sample is an approximation of the population distribution, with an uncertain mean, standard deviation, & other. The more complex the problem, the greater the variation. — So, is there a solution? Yes: K-Fold Cross-Validation. Per known theory, K-Fold CV can significantly slash variance of model performance – the higher the “K”, the better. Without it, classification error can easily differ by 5-15%, if not 20-30%. When deciding what’s “SOTA”, every single percentage point can be a battle hard-fought – so a “mere 5%” is already astronomical.
One may counter-argue, “it’s fine if the test set is large enough”. Except it’s not fine; you get a “large enough” test set by either sacrificing train data, or, dataset is large enough so that you can make an even validation-test split. Former’s undesirable for obvious reasons – and in latter, unless you have a gargantuan dataset (extremely rare), your test samples are still subject to significant-enough variance; merely swapping test & validation samples can flip tables.
As a final punchline, note that the random seed can also substantially impact final outcome, further amplifying variance. Consequence: you don’t know you did well because
Dropout(0.5) works better than
Dropout(0.2) or because dice rolled nicely. K-Fold CV will also reduce seed variance as a side-effect, but ideally (though often prohibitively) you’d do “K seeds”.
Verdict: test set isn’t good for testing. Instead, use K-fold CV, which both better estimates generalization performance by reducing variance, and allows using more train data.
Though I am knowledgeable on the topic, I’m not an “expert” – and even experts disagree. Thus, counterarguments welcome.