[D] Maximizing Likelihood results in a Degenerate VAE?, aka More Variational Autoencoder Confusion
TL;DR I’m still confused about VAEs. My experience is indicating if I take a Beta VAE formulation and try to maximize the likelihood of some hold-out data while varying Beta, my model collapses. Please bare with me:
With Variational Autoencoders, we estimate an approximate likelihood by maximizing the ELBO, aka data likelihood:
log(p(x)) >= L(x) = E(log p(x|z)) – KL(q(z|x)||p(z))
ELBO = L(x) = – (Distortion + Rate)
where E(logo(x|z)) is equivalent to the reconstruction loss (up to scale) and KL divergence is determined via the reparameterization trick.
One interpretation of this is that the model cares about two things: Good reconstruction up to a highly compressed representation. These sort of things are good for interpreting data with respect to the latent space, and whatnot.
We can add a coefficient, B, to explore the trade off between Rate (KLD) and Distortion (reconstruction). That gives us the BetaVAE formulation:
E(log p(x|z)) – B * KL(q(z|x)||p(z)). (e.g. https://openreview.net/forum?id=Sy2fzU9gl )
or you can do other things to “pin” the rate:
There are many papers that use VAEs, often by training the VAE and picking the point in the training that maximizes the ELBO on some hold out set (e.g. https://www.nature.com/articles/s41592-018-0229-2.pdf )
Let’s say I take a VAE and one of the R vs D formulations and scan over Beta and plot the rate vs distortion of the held-out data for different models. I often get something like this (this is a screenshot from the ELBO pape, but I also get approximately these results):
The dotted line is the R vs D tradeoff at the maximum likelihood model, and this occurs when the rate drops to zero. In the case of a “vanilla” VAE, this means a degenerate latent space where all points represent the same thing. All points in Z are the same (i.e. N(0,1)) and I have experienced the (grotesquely named) posterior collapse. In this case the model (usually) only emits the “average” input unless there is some side-channel of information. This is often considered a bad thing, and L(x) is therefore only determined by the reconstruction loss from the input and the “average” emission.
But this is the maximum likelihood model!
Let’s go back to the Lopez et al paper above: If I were to scan over Beta to find the maximum likelihood model and I get a collapsed latent space, I wouldn’t have a model that is particularly useful in that it would not provide “interpretable” latents. In the context where I am performing a conditional prediction task (e.g. https://scholar.google.com/scholar?q=variational+autoencoder+prediction), the VAE would emit the same value no matter the condition.
Is maximum likelihood/ELBO right? One could imagine that there are alternate ways to evaluate this result (i.e. Frechet Inception Distance on generated examples). Should I use those? Should I just be using exact inference methods instead?