Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Maximumizing Likelihood results in a Degenerate VAE?, aka More Variational Autoencoder Confusion

[D] Maximumizing Likelihood results in a Degenerate VAE?, aka More Variational Autoencoder Confusion

TL;DR I’m still confused about VAEs. My experience is indicating if I take a Beta VAE formulation and try to maximize the likelihood of some hold-out data while varying Beta, my model collapses. Please bare with me:

With Variational Autoencoders, we estimate an approximate likelihood by maximizing the ELBO, aka data likelihood:

log(p(x)) >= L(x) = E(log p(x|z)) – KL(q(z|x)||p(z))

aka:

ELBO = L(x) = – (Distortion + Rate)

where E(logo(x|z)) is equivalent to the reconstruction loss (up to scale) and KL divergence is determined via the reparameterization trick.

One interpretation of this is that the model cares about two things: Good reconstruction up to a highly compressed representation. These sort of things are good for interpreting data with respect to the latent space, and whatnot.

We can add a coefficient, B, to explore the trade off between Rate (KLD) and Distortion (reconstruction). That gives us the BetaVAE formulation:

E(log p(x|z)) – B * KL(q(z|x)||p(z)). (e.g. https://openreview.net/forum?id=Sy2fzU9gl )

or you can do other things to “pin” the rate:

E(log p(x|z)) + | C – KL(q(z|x)||p(z)) | (e.g. https://arxiv.org/abs/1804.03599 or https://arxiv.org/pdf/1711.00464.pdf )

There are many papers that use VAEs, often by training the VAE and picking the point in the training that maximizes the ELBO on some hold out set (e.g. https://www.nature.com/articles/s41592-018-0229-2.pdf )

My problem:

Let’s say I take a VAE and one of the R vs D formulations and scan over Beta and plot the rate vs distortion of the held-out data for different models. I often get something like this (this is a screenshot from the ELBO pape, but I also get approximately these results):

Fixing a Broken ELBO, fig 3a

The dotted line is the R vs D tradeoff at the maximum likelihood model, and this occurs when the rate drops to zero. In the case of a “vanilla” VAE, this means a degenerate latent space where all points represent the same thing. All points in Z are the same (i.e. N(0,1)) and I have experienced the (grotesquely named) posterior collapse. In this case the model (usually) only emits the “average” input unless there is some side-channel of information. This is often considered a bad thing, and L(x) is therefore only determined by the reconstruction loss from the input and the “average” emission.

But this is the maximum likelihood model!

Let’s go back to the Lopez et al paper above: If I were to scan over Beta to find the maximum likelihood model and I get a collapsed latent space, I wouldn’t have a model that is particularly useful in that it would not provide “interpretable” latents. In the context where I am performing a conditional prediction task (e.g. https://scholar.google.com/scholar?q=variational+autoencoder+prediction), the VAE would emit the same value no matter the condition.

Is maximum likelihood/ELBO right? One could imagine that there are alternate ways to evaluate this result (i.e. Frechet Inception Distance on generated examples). Should I use those? Should I just be using exact inference methods instead?

submitted by /u/idioth
[link] [comments]