Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Predict gender of people over the phone in a highly unbalanced dataset

Hi everyone,

I am an academic researcher venturing into machine learning for one of my projects. I am trying to identify the gender of company executives based on their recorded voice over the phone. Here are the different datasets that I am working with:

  1. Prediction dataset: 50k recorded voices analyzed for frequency, 8-10% female.
  2. Testing dataset: Subsample of 1k data points the prediction dataset manually coded by me, 20% female (I added more females because I was worried there weren’t enough).
  3. Training dataset: 10k voices from public datasets, 50% female. Voices are resampled at 8KHz to match phone standards and resemble the final dataset.

I have tried a few different models on the training dataset, with great success. When I split the training dataset 80% 20% to run some tests, I get an accuracy of roughly 97%. When I apply the saved model to the testing dataset from the real population, accuracy drops to 85%. I am worried that this is in part due to the imbalance in gender.

What would be the best practices to tackle such problem?

Thanks a lot!

submitted by /u/newtomtl83
[link] [comments]