[P] Predict gender of people over the phone in a highly unbalanced dataset

Written by torontoai on July 21, 2019. Posted in Reddit MachineLearning.

Hi everyone,

I am an academic researcher venturing into machine learning for one of my projects. I am trying to identify the gender of company executives based on their recorded voice over the phone. Here are the different datasets that I am working with:

Prediction dataset: 50k recorded voices analyzed for frequency, 8-10% female.
Testing dataset: Subsample of 1k data points the prediction dataset manually coded by me, 20% female (I added more females because I was worried there weren’t enough).
Training dataset: 10k voices from public datasets, 50% female. Voices are resampled at 8KHz to match phone standards and resemble the final dataset.

I have tried a few different models on the training dataset, with great success. When I split the training dataset 80% 20% to run some tests, I get an accuracy of roughly 97%. When I apply the saved model to the testing dataset from the real population, accuracy drops to 85%. I am worried that this is in part due to the imbalance in gender.

What would be the best practices to tackle such problem?

Thanks a lot!

submitted by /u/newtomtl83
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Predict gender of people over the phone in a highly unbalanced dataset