[D] Clustering methodology for high dimensional data, where some features have strong correlations to one another?
Hi, I’m working on a model to cluster users based on their demographic and behavioral features.
Was reading up on some literature on the topic, and found that having strongly correlated features would skew the dimensionality reduction (right now, via PCA) to take only those features with high correlation with each other.
Was thinking of running a simple correlation matrix to remove those features and sort through the clutter before clustering.
But right now, our methodology looks like… 1. Normalizing our features (mean 0, stdev 1) 2. Correlation matrix to weed out some features 3. PCA or some other dimensionality reduction 4. K-Means Clustering
Problem is there are some features we might not be able to cut – category mixes (e.g. user has spent x% on category A, y% on category B, z% on category C, where x+y+z = 100%) ought to still be relevant in our case but will be highly correlated with one another. Any ideas on how we can handle for this?
And as an aside, how do clustering algorithms (K-means specifically) handle nullness?
Would love for you guys’ take on the methodology! All help appreciated on this, thanks!
submitted by /u/ibetDELWYN
[link] [comments]