Math Intuition Summary on Variational Autoencoder

Published in

Analytics Vidhya

5 min readJan 10, 2021

Detailed explanation on the algorithm of Variational Autoencoder Model

Sumerian, The earliest known civilization

My math intuition summary for the Variational Autoencoders (VAEs) will base on the below classical Variational Autoencoders (VAEs) architecture.

**VAEs Architecture** @https://github.com/AndrewSpano/Disentangled_Variational_Autoencoder

We expecting x ≈ x’, this is same as Autoencoder Neural Network
Probabilistic Encoder qϕ(z|x) maps each input x into a corresponding normal distribution N(μ,σ) in latent space. So the latent space actually is kind of sum of all the normal distributions mapped from entire input dataset, like a Gaussian Mixture Model.
Probabilistic Decoder pθ(x|z) maps a latent space sample z back into input data space.
z is a sample from latent space. The latent space is expecting to follow a standard multivariate normal distribution, and normally, has much less dimensions than input x

The space mapping relationship between x, qϕ(z|x), pθ(x|z), and z are illustrated in the below flowchart:

**VAEs Space Mapping** @An Introduction to Variational Autoencoders

Object Function Building

Variational Autoencoders(VAEs) object function is building from Maximum Log-Likelihood.

Maximum Log-Likelihood

Given VAEs model input dataset D ={ x1,x2, , , xn }, pθ(x) represents model output probability distribution function(PDF on output x), then the log-probability VAEs model aims to maximize is as below:

In order to maximize the sum of log(pθ(x)), we can try to maximize each log(pθ(x))

Maximize log(pθ(x))

Below is the mathematical induction from original paper. ELBO means Evidence Lower Bound, KL means KL divergence which is a measure of the distance of two distributions.

Source from @An Introduction to Variational Autoencoders

Because any KL distance always ⩾ 0, therefore

log(pθ(x)) = ELBO + KL(qϕ(z|x)||pθ(z|x)) ⩾ ELBO

ELBO Induction

ELBO = 𝔼qϕ(z|x) [ log ( pθ(x,z) / qϕ(z|x) ) ]

= 𝔼qϕ(z|x) [ log ( pθ(x|z) * pθ(z) / qϕ(z|x) ) ]

= 𝔼qϕ(z|x) [log ( pθ(z) / qϕ(z|x) ) ] + 𝔼qϕ(z|x) [ log ( pθ(x|z) ) ]

= -KL(qϕ(z|x)||pθ(z)) + 𝔼qϕ(z|x) [ log ( pθ(x|z) )]

Here pθ(z) is a known distribution N(0, I), so ELBO is only decided by qϕ(z|x) and pθ(x|z)

Maximize log(pθ(x)) Reconsideration

From log(pθ(x)) induction, we know

log(pθ(x)) = ELBO + KL(qϕ(z|x)||pθ(z|x)) ⩾ ELBO

The original paper mention, in order to maximize log(pθ(x)) we need maximize ELBO. The intuition here is because ELBO will use qϕ(z|x) to maximize its value, and qϕ(z|x) can’t directly decide log(pθ(x)) value, but qϕ(z|x) can decide log(pθ(x)) minimum value via ELBO.
So maximize ELBO will maximize the minimum value of log(pθ(x)), and make log(pθ(x)) generally larger, this is my understanding.

Maximize ELBO in order to Maximize log(pθ(x))

ELBO = -KL(qϕ(z|x)||pθ(z)) + 𝔼qϕ(z|x) [ log ( pθ(x|z) )]

In order to maximize ELBO, we need

1.Minimize KL(qϕ(z|x)||pθ(z))

This means we need make KL(qϕ(z|x)||pθ(z)) = 0, so Encoder qϕ(z|x) need to be trained to approximate a standard normal distribution pθ(z)

So qϕ(z|x) ~ N(0, I)

2.Maximize 𝔼qϕ(z|x) [ log ( pθ(x|z) )]

Maximizing the equation means, for given input x, model need maximize the probability to generate same x as output. This is try to make x’ = x

Below is good presentation of ELBO, From Variational Autoencoders Architecture view

Source from https://www.youtube.com/watch?v=Tc-XfiDPLf4

Final Object Function

Summarize the above analysis, the final Loss Function for Variational Autoencoders(VAEs) is:

LOSS = |X - X’|² + KL( qϕ(z|x) , N(0, I))

The Loss function will guide model to achieve:

Output reproduces the same input as much as possible
The input to latent space mapping area(normal distribution) will try to be centered on 0 and expanded on variance 1 in latent space
The latent space will be compacted towards to 0 instead of separate areas

Reparameterization Trick

Variational Autoencoders network training still uses Error Back Propagation and Derivative Chain Rule. But in VAEs network forward propagation process, latent vector z is randomly sampled, and this cause Back Propagation can not be directly applied.

So VAEs work around sampling not derivable (or not differentiable) issue by the below transformation:

z’ = μ(x) + σ(x)*ϵ

ϵ~N(0, I), μ(x) and σ(x) are outputs of encoder qϕ(z|x)

The transformation has the following features

z’ is still a stochastic value due to random ϵ
z’ follows the same N(μ,σ) distribution as z
z’ become derivable on parameter ϕ, because σ(x) is part of qϕ(z|x)

After the transformation, the back propagation derivative chain path is reopened as below map presents. (Note we don’t need derivative path on ϵ)

Variational Inference in Variational Autoencoders

In statistics, Variational inference (VI) is a technique to approximate complex intractable distributions using a set of tractable simple distributions.

In Variational Autoencoders model, the intractable distribution is P(Z|X), the encoder qϕ(z|x) maps each input x into Gaussian distribution N(μ,σ), and these group of Gaussian together approximate P(Z|X) via model gradient descend training.