# Blog

## 5000+ Members

### MEETUPS

LEARN, CONNECT, SHARE

### JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

### CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

# [D] Maximumizing Likelihood results in a Degenerate VAE?, aka More Variational Autoencoder Confusion TL;DR I’m still confused about VAEs. My experience is indicating if I take a Beta VAE formulation and try to maximize the likelihood of some hold-out data while varying Beta, my model collapses. Please bare with me: ​ With Variational Autoencoders, we estimate an approximate likelihood by maximizing the ELBO, aka data likelihood: log(p(x)) >= L(x) = E(log p(x|z)) – KL(q(z|x)||p(z)) aka: ELBO = L(x) = – (Distortion + Rate) where E(logo(x|z)) is equivalent to the reconstruction loss (up to scale) and KL divergence is determined via the reparameterization trick. ​ One interpretation of this is that the model cares about two things: Good reconstruction up to a highly compressed representation. These sort of things are good for interpreting data with respect to the latent space, and whatnot. ​ We can add a coefficient, B, to explore the trade off between Rate (KLD) and Distortion (reconstruction). That gives us the BetaVAE formulation: E(log p(x|z)) – B * KL(q(z|x)||p(z)). (e.g. https://openreview.net/forum?id=Sy2fzU9gl ) or you can do other things to “pin” the rate: E(log p(x|z)) + | C – KL(q(z|x)||p(z)) | (e.g. https://arxiv.org/abs/1804.03599 or https://arxiv.org/pdf/1711.00464.pdf ) ​ There are many papers that use VAEs, often by training the VAE and picking the point in the training that maximizes the ELBO on some hold out set (e.g. https://www.nature.com/articles/s41592-018-0229-2.pdf ) ​ My problem: Let’s say I take a VAE and one of the R vs D formulations and scan over Beta and plot the rate vs distortion of the held-out data for different models. I often get something like this (this is a screenshot from the ELBO pape, but I also get approximately these results): Fixing a Broken ELBO, fig 3a The dotted line is the R vs D tradeoff at the maximum likelihood model, and this occurs when the rate drops to zero. In the case of a “vanilla” VAE, this means a degenerate latent space where all points represent the same thing. All points in Z are the same (i.e. N(0,1)) and I have experienced the (grotesquely named) posterior collapse. In this case the model (usually) only emits the “average” input unless there is some side-channel of information. This is often considered a bad thing, and L(x) is therefore only determined by the reconstruction loss from the input and the “average” emission. ​ But this is the maximum likelihood model! ​ Let’s go back to the Lopez et al paper above: If I were to scan over Beta to find the maximum likelihood model and I get a collapsed latent space, I wouldn’t have a model that is particularly useful in that it would not provide “interpretable” latents. In the context where I am performing a conditional prediction task (e.g. https://scholar.google.com/scholar?q=variational+autoencoder+prediction), the VAE would emit the same value no matter the condition. ​ Is maximum likelihood/ELBO right? One could imagine that there are alternate ways to evaluate this result (i.e. Frechet Inception Distance on generated examples). Should I use those? Should I just be using exact inference methods instead? submitted by /u/idioth [link] [comments]