[D] How to deal with a classification problem of a big mbalanced dataset?

Written by torontoai on October 5, 2019. Posted in Reddit MachineLearning.

I have a dataset of 8 million unique members, approximately 800 million records. Of those 8 million members I have a positive sample of about 25000. It’s a binary classification problem. I would like to not simply downsample although the downsampled RF performs pretty well. The data is on a Hadoop cluster. I only have access to it via a Zeppelin notebook with PySpark. It’s a pain in the ass to get approval for packages installed. PySpark is even in python 2.7 and I don’t really use Python 2. What should I do? The notebook is in a VM that’s not connected to the worldwideweb. I would have to rewrite solutions like SMOTE if I wanted to use it. I found a package but it takes like a week for approval and I only have two more weeks for the project. I wanted to use a balanced or weighted random forest but I don’t see a native spark.ml implemention. I’m also kind of new to spark.

Any tips or advice on how to proceed? Would highly appreciate.

submitted by /u/melesigenes
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] How to deal with a classification problem of a big mbalanced dataset?