[P] Predict gender of people over the phone in a highly unbalanced dataset
I am an academic researcher venturing into machine learning for one of my projects. I am trying to identify the gender of company executives based on their recorded voice over the phone. Here are the different datasets that I am working with:
- Prediction dataset: 50k recorded voices analyzed for frequency, 8-10% female.
- Testing dataset: Subsample of 1k data points the prediction dataset manually coded by me, 20% female (I added more females because I was worried there weren’t enough).
- Training dataset: 10k voices from public datasets, 50% female. Voices are resampled at 8KHz to match phone standards and resemble the final dataset.
I have tried a few different models on the training dataset, with great success. When I split the training dataset 80% 20% to run some tests, I get an accuracy of roughly 97%. When I apply the saved model to the testing dataset from the real population, accuracy drops to 85%. I am worried that this is in part due to the imbalance in gender.
What would be the best practices to tackle such problem?
Thanks a lot!