[P] Methodology for an application paper on imbalanced classification
Hi All, I’m a PhD student in the UK and I am currently working on a paper. I won’t go into the specifics of the problem, but I will lay out my methodology and I would love to hear your opinions. My aim is to publish in a journal related the application, not a machine learning journal. Because of this the machine learning techniques used here are, by design, not novel.
The problem boils down to imbalanced binary classification with around 15,000 data points each with 9 features, only around 600 of which belong to the minority class. We call this DataSet1. In preliminary experiments I implement steps 1 and 2 on DataSet1.
Step 1: Using Sklearn in Python: Use k-fold cross validation to compare the performance of 10 popular classifiers: SVM, Random Forest, Logistic Regression, etc. Gridsearch would be used for hyperparameter selection and the models would be scored using the area under their ROC curve (AUC).
Step 2: The best three performing classifiers would then be assessed in conjunction with sampling techniques such as under sampling, over sampling and SMOTE. Implementation of classifiers which internally incorporate sampling methods (such as balanced random forest and balanced bagging) would also be tested.
In my case the DataSet1 is time series. As this experimentation did not give good enough results, I decide to implement step 3 on DataSet1
Step 3: Reformulate the data set by having each data point also include the first λ lags (the λ previous observations) of each variable. Where λ is a natural number.
This produced DataSet2 in which each data point now contained λ*9 features. The aim was then to implement step 1 and 2 on this new data set. However, the high dimensionality combined with the large number of data points, cross validation and Gridsearch hyperparameter selection lead to these experiments having an inconveniently long run time.
To reduce the dimensionality PCA is used on the DataSet1 to produce DataSet3 in which data points are observations of the principal components required to explain 95% of the total variance. Step 3 is then implemented on DataSet3 creating DataSet4. Steps 1 and 2 can then be implemented on DataSet4. For a suitable value of λ in my case 12, this led to much more accurate classification.
Is this a good line of experimentation? Is this AUC scoring alone sufficient? Is there anything about this method that is bad practice? Are the models that I am using outdated? Any feedback would be greatly appreciated! Thanks!