Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] How to deal with a classification problem of a big mbalanced dataset?

I have a dataset of 8 million unique members, approximately 800 million records. Of those 8 million members I have a positive sample of about 25000. It’s a binary classification problem. I would like to not simply downsample although the downsampled RF performs pretty well. The data is on a Hadoop cluster. I only have access to it via a Zeppelin notebook with PySpark. It’s a pain in the ass to get approval for packages installed. PySpark is even in python 2.7 and I don’t really use Python 2. What should I do? The notebook is in a VM that’s not connected to the worldwideweb. I would have to rewrite solutions like SMOTE if I wanted to use it. I found a package but it takes like a week for approval and I only have two more weeks for the project. I wanted to use a balanced or weighted random forest but I don’t see a native implemention. I’m also kind of new to spark.

Any tips or advice on how to proceed? Would highly appreciate.

submitted by /u/melesigenes
[link] [comments]