Diffusion LLMs: Generating Text the Way We Generate Images
What are Diffusion LLMs? How do they work?
Welcome to Infinite Curiosity, a newsletter that explores the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:
Large Language Models (LLMs) have transformed the way text is processed and generated. The most famous example of this is ChatGPT. There are many such LLMs in the world today. They are built using autoregressive models, which generate text sequentially from left to right.
This has worked spectacularly well so far, but it often suffers from issues like error compounding and limited global coherence.
Diffusion LLMs offer a new way of doing this by starting with a noisy text representation and iteratively refining it. This is similar to how images are progressively denoised in diffusion-based image generation.
These models have been difficult to build and commercialize. But Inception Labs recently announced a model named Mercury, which is the first commercial-grade diffusion large language model. So I decided to dig in.
What Are Diffusion LLMs?
Diffusion models have been very successful in image generation. These models generate images by starting with pure noise and applying iterative denoising steps to gradually reveal the final structure. The same principle can be applied to text generation.
In Diffusion LLMs, text generation begins with a noisy latent representation rather than sequentially producing tokens. A denoising process then refines this representation over multiple iterations by improving coherence, grammar, and factual accuracy.
How are these models trained?
Noise Injection: During training, clean text embeddings are progressively corrupted with noise.
Denoising Objective: The model learns to recover the original text from its noisy version. This improves its ability to refine textual outputs.
Iterative Refinement: Rather than predicting text token by token, the model iteratively updates entire segments. This leads to more coherent and contextually aware generation.
The Image Loading Analogy: From Blurry to Crisp
To better understand Diffusion LLMs, consider how images load on a slow network. Initially, the image appears as a pixelated blur. And as the data loads, more details progressively emerge until the full-resolution image is revealed. This is akin to how Diffusion LLMs generate text:
Initial Blurry Representation: The model starts with a rough, noisy approximation of the intended output.
Successive Refinement: Multiple iterations improve fluency, structure, and logical consistency.
Final High-Quality Text: The denoised output is polished, contextually accurate, and globally coherent.
In contrast, autoregressive models operate more like typing out a sentence without the ability to go back and adjust earlier words. This often leads to inconsistencies or forced corrections.
Why are Diffusion LLMs Better Than Autoregressive Models?
There are a few advantages:
Global Coherence: Autoregressive models suffer from issues related to long-range dependencies because they generate text in a strictly left-to-right manner. Diffusion LLMs refine text holistically. And this leads to better structural integrity and logical consistency across entire paragraphs.
Reduced Exposure Bias: Exposure bias occurs in autoregressive models because each token is generated based on previously predicted tokens rather than actual ground truth sequences. Mistakes in early tokens propagate forward and this tends to compound the errors. Diffusion LLMs mitigate this by refining the entire text representation simultaneously. This prevents the accumulation of errors.
More Control & Flexibility: Diffusion models allow for greater control over the generation process. Text can be conditioned on additional constraints (e.g. style, length, factuality) more effectively. Intermediate denoising steps provide an interpretable refinement process. This makes it easier to steer the output towards desired properties.
Potential for Higher Quality Outputs: Iterative improvement ensures that the final text is more coherent and polished. This allows for non-sequential adjustments and enables more contextually rich responses.
What are the challenges?
There are a few challenges to think of:
Computational Cost: Unlike autoregressive models that generate text in a single pass, diffusion models require multiple denoising steps. This makes them computationally expensive.
Architectural Challenges: Defining effective noise functions for text is more complex than for images. Why? Because text operates in a discrete space rather than a continuous one.
Optimization Issues: Training diffusion models for text generation remains an active research area. It requires efficient sampling strategies to balance quality and inference speed.
Where do we go from here?
Diffusion LLMs represent a fundamental shift in how we approach text generation. They address many of the limitations inherent in autoregressive models. By refining text holistically, they offer greater control over the output.
Future developments may explore hybrid models that combine the strengths of diffusion and autoregressive techniques. This could pave the way for more efficient and high-quality text generation systems. As research progresses, diffusion-based approaches have the potential to surpass traditional LLMs. What will this unlock? I’m curious to find out.
If you're a founder or an investor who has been thinking about this, I'd love to hear from you.
If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 friend who’s curious about AI: