[D] 17 interviews (4 phone screens, 13 onsite, 5 different companies), all but two of the interviewes asked this one basic classification question, and I still don’t know the answer…
I’ve been trying to get back into a more ML/science based role (currently I’m more on the tech business side). Within my own specific domain, I know all of the major algorithms and have been able to shine in that particular topic (times series and regression models). When it comes to generic data science, I have been able to handle myself quite well on most fronts (probability questions, conceptual questions, what is the central mean theorem? can you explain MLE? etc…) .
One topic kept coming up though, with 15 out of the 17 interviewers, across all 5 companies (including two of the biggest names in tech) asking this exact question:
Suppose you have a binary classifier (logistic regression, neural net, etc…) how do you handle imbalanced data sets in production?
I don’t know 🙁 . I know that you need to be careful with which metric you use to evaluate your model, that you should look at precision and recall or the ROC, instead of just accuracy. And that your sampling strategies should change to better reflect each class. But all of this is during training.
Once in production, I know that you face a catch-22 situation:
- If you don’t skew your training data, then you don’t have enough data from the sparse class for the classifier to learn something, and it will just learn to always predict the dense class.
- If you do skew your data, then now you’re facing a situation where the distribution of the training data and the distribution of the production data are completely different, so your model won’t predict well (at least my understanding is that different distributions in test and in prod is always a recipe for disaster).
Is my assessment of the dilemma correct? And how do you solve it?
Why is this question so popular (FWIW – none of these companies were doing medical or security applications….)
Some follow up questions and/or hints that were given (but I still couldn’t really answer the question in a satisfactory way):
- If this is the case, but only you noticed that your binary classifier is not performing well only after you have already deployed it in production and had been scoring it for a few weeks, what do you do? (My answer, go back to training, and either re-evaluate which features you want to use, or find more data to train on) , second follow from the same person: What if I told you that you are stuck with the same model and couldn’t get any more data, what do you do then (I answered: l1 or l2 regularization? but these are applicable to any data set, they aren’t specific to imbalanced data. Fiddle with the K in your K-fold CV? that wouldn’t work either — by this point I felt like I was being Kobayashi Marued…)
- Can you adjust your classifier after training, but before deploying it, so that it is adjusted to the original distribution, not the skewed (downsampled or upsampled) distribution you used during training? (Drew a blank – as far as I know, any adjustment to the model based on knowledge prior to deployment constitutes training in one form or the other….)
With regards to the second question, I did come across [this thread and the blog that it linked to](https://stats.stackexchange.com/a/403244/89649) . It applies only to logistic regression, not any other binary classifier as far as I can tell . What about other classifiers? (Or is it that logistic regression is the only applicable algorithm in the imbalanced case?)