[P] Help implementing “Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders” paper?
Here is the link the paper I am referring to: https://arxiv.org/pdf/1611.02648.pdf
I am having trouble understanding how to implement this paper correctly. So I understand that instead of using an isotropic Gaussian as the prior for the latent space, they are using a mixture of Gaussians. And then I am really having trouble understanding how they are calculating their lower bound, or specifically the terms in it, which are reconstruction term, conditional prior term, w-prior term and z-prior term. The z-prior term is a direct probability of the class a data point would belong to, and I am not sure where they are getting this from. So if anybody could offer any help or point me to somewhere I could find some help, it would be greatly appreciated!
And in summary, here are my questions:
- How are they generating a mixture of Gaussians for the latent space? Does this mean creating n distributions for the latent space (where n is the number of clusters), so basically having n sets of mean and variance layers (each with the size of the latent space), rather than just one set of these layers like in a normal variational autoencoder? Or is the latent space still representing all the data, and then Gaussian distributions are sampled from the latent space?
- How are they reparameterizing the distributions of multiple distributions (assuming my understanding of how they are doing the multiple Gaussians correct, which it’s very likely not)?
- How are they directly outputting the probability that a sample belongs to a certain distribution, which represents a cluster?
Any help is much appreciated, thank you!
Update: While thinking about it, I got this idea – they are generating n (number of clusters) mean and variance layers, and then averaging them out in order to do the reparametrization for the latent space? Or maybe averaging the reparametrization terms? I don’t know if that’s right, and it seems crazy expensive computationally if the value of n is large? And wouldn’t the mean and variance layers all be the same if done this way? I don’t know, I’m just confused, and it is very late.