[D] How to deal with a classification problem of a big mbalanced dataset?
I have a dataset of 8 million unique members, approximately 800 million records. Of those 8 million members I have a positive sample of about 25000. It’s a binary classification problem. I would like to not simply downsample although the downsampled RF performs pretty well. The data is on a Hadoop cluster. I only have access to it via a Zeppelin notebook with PySpark. It’s a pain in the ass to get approval for packages installed. PySpark is even in python 2.7 and I don’t really use Python 2. What should I do? The notebook is in a VM that’s not connected to the worldwideweb. I would have to rewrite solutions like SMOTE if I wanted to use it. I found a package but it takes like a week for approval and I only have two more weeks for the project. I wanted to use a balanced or weighted random forest but I don’t see a native spark.ml implemention. I’m also kind of new to spark.
Any tips or advice on how to proceed? Would highly appreciate.
submitted by /u/melesigenes
[link] [comments]