[D] Suggestions on good practice when merging k-means centroids
Hi, I posted this on /r/datascience but thought I’d x-post for visibility.
I was wondering if I could get some feedback into whether my methodology is problematic or not.
I’m working with a pre-established set of 12 cluster centroids in a classification problem, based on the output of a 42-element 2D joint histogram. When classifying points, these histograms are collapsed so that the classification is only done on a 3-element vector representing the mean quantities of the data point.
The purpose of this is to identify cloud types, to feed into some of my cluster-specific analysis. Now, three adjacent cluster centroids all refer to the same cloud type, however they have different ‘thicknesses’. In my final work, I’d like there to be just a single cluster to represent this type. I worry though, by just merging by, for instance, taking the mean of these centroids, the classification step will miss many points that would’ve ordinarily been assigned to these clusters, because they might then be closer to another centroid which doesn’t represent the data point accurately.
My idea is to classify my datapoints with a codebook containing the three centroids (say, clusters 1, 2, and 3). After allocating all my points, I’d then merge the clusters together into a single classification. This would then result in a cluster which has been manually extended to capture points that wouldn’t ordinarily be in it.
Is this a problematic way of merging clusters, as opposed to say, taking the mean of the three cluster centroids? Or are there better ways of doing this?
I’ve drawn out a basic diagram attempting to illustrate what I mean – https://i.imgur.com/aGIW3f5.jpg Thanks a lot in advance
EDIT: I’ve looked at agglomerative clustering but I’m working off an (almost) nicely defined set of clusters, aside from this issue. I tried merging cluster centroids using agglomerative but unfortunately it agglomerated together two which I didn’t want merged. (PS. is this how you do agglom clustering? Can you just train the agglom algorithm by passing it the original k-means centroids?)