[D] Why does Beta-VAE help in learning disentangled/independent latent representations?
In Beta-VAE paper (https://openreview.net/pdf?id=Sy2fzU9gl), the authors mentioned that having Beta > 1 helps the network in learning independent latent representations. However, in VAE, the posterior distribution itself is assumed to be a Gaussian with a diagonal covariance matrix, i.e.
q(z|x) = N(U(x),Cov(x)) where Cov(x) is a diagonal matrix.
This means that we are inherently generating latents that will be independent given an input image x. So why does increase learning pressure on the KL divergence term between posterior and Gaussian prior should help any more in learning independent latents when posterior is already assumed to be independent?