Make your own Dall-E ? Step by Step guide on Latent Diffusion in Python

7 min readApr 19, 2024

Introduction

Have you ever used one of those image-making tools like DALL-E? Ever wonder how these magic machines churn out pictures from a few typed words? Well, I did, and let me tell you, it’s a fascinating process called Latent diffusion, a real neat trick in the world of AI.

It took a while to understand all the complex elements that make up the architecture but unfortunately, I found it way too technical.

With no good explanations available online to explain all these steps in simple words, I decided to make one myself. So no matter if you’re a beginner in AI or a pro, I really hope you get some value out of this.

Without further ado, let’s get started !

From Noise to … Art ? Diving Into the World of Latent Diffusion

So what is this latent diffusion we are talking about ?

Picture latent diffusion like a highly skilled artist who decides to work smarter, not harder. Instead of painting directly on a huge canvas, this artist sketches on a smaller, more manageable pad, capturing all the essential elements before scaling it up. That’s what latent diffusion does. It operates not on the raw, pixel-packed images but within a compressed, easier-to-handle latent space, making the whole process quicker and less resource-intensive.

Okay, so the obvious question now becomes .. What is latent diffusion made up to do this rigmarole ?

Meet the Team Behind the Magic

Now, If you dive into the technical terms, I’ll have explain you that Latent diffusion is made up CLIP (Contrastive Language–Image Pre-training), Variational Auto Encoder ( VAE ) and UNet ( Convolutional Neural Network ) and I’m pretty sure by the time I finish this sentence you’ll probably be fast asleep ..

So let’s try another route. Imagine you’re putting together a superhero team where each member has a unique role that is crucial for the mission.

CLIP: The Mind Reader

Think of CLIP as the team’s mind reader. It looks at a line of text — your wish for what the image should depict — and understands it like a pro. It’s like a doctor who listens to your symptoms and figures out exactly what you need. CLIP takes these insights and makes sure the image reflects them, acting as a bridge between words and pictures.

VAE: The Efficient Packer

Next up is the VAE, or Variational Autoencoder, think of it as your packing expert when you’re going on a trip. It knows just how to fold everything neatly so it fits into a smaller suitcase. In our case, the VAE compresses images into a more compact form, a latent space, ensuring all the important details are retained but in a way that’s easier to handle and process.

UNet: The Refiner

Then there’s UNet, the detail-oriented refiner of the group. It’s like having a meticulous editor who goes through a draft, correcting little errors and polishing it until it’s just right. In the world of latent diffusion, UNet works step-by-step to reduce noise and enhance the image’s quality, refining the visual until it’s a masterpiece.

Visualize all of them working together like this :

CLIP: The Insight Gatherer : First, CLIP takes the stage. It’s like the group’s scout, venturing out to understand the lay of the land — or in our case, the text prompt you feed it. It reads your description, like “a sunny day at the beach,” and converts this textual input into a set of digital insights, known as embeddings. These embeddings are rich with the context and nuances of your words, packed into a format that the rest of the team can work with.
VAE: The Shaper : Enter the VAE, our efficient packer, who takes these insights and transforms them into a more manageable form. It compresses the vast world of possible images into a compact latent space — a smaller, denser representation of potential images. This step is crucial because it reduces the complexity and size of the data, making the next steps faster and more focused.
UNet: The Sculptor : Finally, UNet steps up. This part of the process is akin to a sculptor refining a block of marble. UNet takes the compressed image from the VAE and begins the meticulous task of denoising. It incrementally adjusts and enhances the image, step by step, removing imperfections and adding details. With each iteration, the image becomes clearer and more detailed, moving closer to the final masterpiece that perfectly matches your initial prompt.

Understood the theory ? Ready to make one of your own ?

Well then, what are you waiting for ? Let’s get started.

Here’s the complete Kaggle notebook for those who want to try :

https://www.kaggle.com/code/deveshsurve/step-by-step-guide-to-implement-latent-diffusion

Step by Step Implementation

We start with installing our required libraries not available by default on kaggle

!pip install diffusers

Importing Libraries

import torch, logging

## disable warnings
logging.disable(logging.WARNING)  

## Imaging  library
from PIL import Image
from torchvision import transforms as tfms

## Basic libraries
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
import shutil
import os

## For video display
from IPython.display import HTML
from base64 import b64encode


## Import the CLIP artifacts 
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, LMSDiscreteScheduler
from IPython.display import display, clear_output
import os

Next, setup GPU support if you can

device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

Now, let’s define our helper functions ( I’ve added line by line details on kaggle notebook )

## Helper functions
def load_image(p):
    '''
    Function to load images from a defined path
    '''
    return Image.open(p).convert('RGB').resize((512,512))

def pil_to_latents(image):
    '''
    Function to convert image to latents
    '''
    init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0
    init_image = init_image.to(device="cuda", dtype=torch.float16) 
    init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215
    return init_latent_dist

def latents_to_pil(latents):
    '''
    Function to convert latents to images
    '''
    latents = (1 / 0.18215) * latents
    with torch.no_grad():
        image = vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (image * 255).round().astype("uint8")
    pil_images = [Image.fromarray(image) for image in images]
    return pil_images


def text_enc(prompts, maxlen=None):
    '''
    A function to take a texual promt and convert it into embeddings
    '''
    if maxlen is None: maxlen = tokenizer.model_max_length
    inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt") 
    return text_encoder(inp.input_ids.to("cuda"))[0].half()

Now, let’s go for the Actual Diffusion Process. A thing to know here is that VAE, UNet, CLIP are Pretrained models. We could theoretically make one of our own, buuuut it will take a ginormously long time.

So, we just use pretrained ones.


## Initiating tokenizer and encoder.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

## Initiating the VAE
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")

## Initializing a scheduler and Setting number of sampling steps
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
scheduler.set_timesteps(50)

## Initializing the U-Net model
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

Now comes our loop which will take the noise and turn it into art

def prompt_2_img(prompts, g=7.5, seed=100, steps=70, dim=512, save_int=True):
    """
    Diffusion process to convert prompt to image
    """
    
    # Defining batch size
    bs = len(prompts) 
    
    # Converting textual prompts to embedding
    text = text_enc(prompts) 
    
    # Adding an unconditional prompt , helps in the generation process
    uncond =  text_enc([""] * bs, text.shape[1])
    emb = torch.cat([uncond, text])
    
    # Setting the seed
    if seed: torch.manual_seed(seed)
    
    # Initiating random noise
    latents = torch.randn((bs, unet.in_channels, dim//8, dim//8))
    
    # Setting number of steps in scheduler
    scheduler.set_timesteps(steps)
    
    # Adding noise to the latents 
    latents = latents.to("cuda").half() * scheduler.init_noise_sigma

    print("Processing text prompts:", prompts)
    # Just before the loop starts:
    print("Visualizing initial latents...")
    latents_norm = torch.norm(latents.view(latents.shape[0], -1), dim=1).mean().item()
    print(f"Initial Latents Norm: {latents_norm}")

    # Iterating through defined steps
    for i,ts in enumerate(tqdm(scheduler.timesteps)):
        # We need to scale the i/p latents to match the variance
        inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)
        
        # Predicting noise residual using U-Net
        with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)
            
        # Performing Guidance
        pred = u + g*(t-u)
        
        # Conditioning  the latents
        latents = scheduler.step(pred, ts, latents).prev_sample
        
        # Inside your loop, after `latents` have been updated:
        latents_norm = torch.norm(latents.view(latents.shape[0], -1), dim=1).mean().item()
        print(f"Step {i+1}/{steps} Latents Norm: {latents_norm}")
        
        from IPython.display import display, clear_output
        if save_int and i%13==0: 
            image_path = f'steps2/la_{i:04d}.jpeg'
            latents_to_pil(latents)[0].save(image_path)
            display(latents_to_pil(latents)[0])  # Display the new image

    return latents_to_pil(latents)

And now, time for the magic

images = prompt_2_img([“A dog wearing a hat”], save_int=True)
for img in images:display(img)

Conclusion

There you have it! A fun and accessible way to understand and build your very own AI-powered image generator. It’s like assembling a dream team of tech wizards, each playing their part to bring your creative visions to life. Why not give it a whirl and see what incredible images you can create?

Now I usually don’t ask for these but, if you read all the way here and you found this article useful, I would really appreciate a comment or a clap!

Please follow if this type of content interests you !