[D] Should my dataset be balanced if the distribution in the real world is imbalanced?
Say I am predicting smokers in my dataset, and from prior knowledge I know that 15% of adults in the U.S. are smokers. My end goal is to deploy my model into a database that has information on 200+ million adults in the U.S to find potential smokers for a marketing campaign.
For my modeling data, should I purposefully mimic the “distribution” of smokers in the U.S. and have 15% of the data be smokers, and 85% of the data be non-smokers? Most of my coworkers have said “it’s easier to just balance them” but I believe this model would not be generalizable to the entire U.S. population if I keep it balanced.