Demystifying Diffusion Models

Diffusion models have emerged as a groundbreaking advancement in AI, offering a fresh approach to generating high-quality synthetic data. These models gained prominence after the success of Generative Adversarial Networks (GANs), which revolutionized image synthesis and creative AI. While GANs rely on a game-theoretic framework between a generator and a discriminator, diffusion models take inspiration from physics, simulating the reverse process of noise diffusion to generate data from pure randomness. Their robustness, scalability, and ability to produce diverse outputs have positioned them as a promising alternative, paving the way for innovations in fields like art, healthcare, and language generation.

Limits of GANs over Diffusion Models:

Mode Collapse: In GANs, mode collapse happens when the generator focuses on producing a limited set of data patterns that deceive the discriminator. It becomes fixated on a few dominant modes in the training data and fails to capture the full diversity of the data distribution.
Low Resolution of images generated as compared to diffusion models: The diffusion model's outputs are better and more realistic.
Less explainability than diffusion models
Low training stability than diffusion models: They are unstable and sensitive to hyperparameters, often requiring careful tuning and experimentation to achieve good results. Also, oscillations and vanishing gradients are some common drawbacks of GANs
Prone to overfitting and may not generalize to new unseen data: Diffusion models are quite robust to overfitting due to the different training processes they use. A primary advantage of diffusion models over GANs and VAEs is the ease of training with simple and efficient loss functions and their ability to generate highly realistic images. They excel at closely matching the distribution of real images, outperforming GANs in this aspect.

Denoising Diffusion Probabilistic Models (DDPMs)

Objective:

The training objective of diffusion-based generative models amounts to “maximizing the log-likelihood of the sample generated (at the end of the reverse process) (x) belonging to the original data distribution.”

Features:

DDPMs are a class of generative models that work by iteratively adding noise to an input signal (like an image, text, or audio) and then learning to denoise from the noisy signal to generate new samples.
Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent. For example, Ho et al paper used a U-Net based on a Wide ResNet with four feature map resolutions with two convolutional residual blocks per resolution level and self-attention blocks.

Mathematics:

Two terms are:

q(Xt | Xt-1):

This term is also known as the forward diffusion kernel (FDK).
It defines the PDF of an image at timestep t in the forward diffusion process x_t given image x_t-1.

It denotes the “transition function” applied at each step in the forward diffusion process.

p theta (Xt-1 | Xt):

Similar to the forward process, it is known as the reverse diffusion kernel (RDK).
It stands for the PDF of x_t-1 given x_t as parameterized by 𝜭. The 𝜭 means that the parameters of the distribution of the reverse process are learned using a neural network.
It’s the “transition function” applied at each step in the reverse diffusion process.

Gaussian distribution parameterization: We just need the model to predict the distribution mean and standard deviation given the noisy image and time step. Ho et al only predicted the mean of the Gaussian and that is what we also did and had the variance fixed.

During the forward process, we use beta values to control the variance of the forward diffusion and now during reverse denoising processes, we’ll use sigma shown on the denoising formula. Often linear schedules, betas, are set equal to sigma squared.

What is KL Divergence?

Kullback-Leibler (KL) divergence, or relative entropy, is a metric used to compare two data distributions. It is a concept of information theory that contrasts the information contained in two probability distributions.

Divergence is a measure that provides the statistical distance between two distributions.
KL divergence is an asymmetric divergence metric defined as the number of bits required to convert one distribution into another.
A zero KL divergence score means that the two distributions are exactly the same. A higher score defines how different the two distributions are.
KL divergence is used in AI as a loss function to compare the predicted data with true values.
Some other AI applications include generative adversarial networks (GANs) and data model drifting.

KL divergence is a way to measure how different two probability distributions are from each other. It gives you an idea of how much extra information you would need if you used one distribution (q(x)) to approximate another distribution (p(x)).

For two probability distributions p(x) and q(x), the KL divergence formula is:

This formula sums up the product of the probability of each event x in p(x) and the logarithm of the ratio between p(x) and q(x). If p and q are very similar, the KL divergence will be close to 0. If they are very different, the KL divergence will be larger.

Why can KL Divergence never be negative?

As logarithm (log term) is involved.

What is the loss function?

The model predicts noise at each time step. This is the final loss function we use to train DDPMs, which is just a “Mean Squared Error” between the noise added in the forward process and the noise predicted by the model. This is the most impactful contribution of the paper Denoising Diffusion Probabilistic Models.

Hope you found the article insightful! Subscribe to CSE Insights by Simran AnandCSE Insights by Simran Anand. Follow me on LinkedIn for technical content.

For 1:1 mentorship, book a call with me on Topmate!

Thank you and follow for more! :)

Introduction to Diffusion Models

Table of contents

Denoising Diffusion Probabilistic Models (DDPMs)