Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin University's Gilead & Ant Group

Models & Benchmarks · Published: Feb 18, 2025 · Elena Volkov · ~9 min read

Author

Elena Volkov · Machine Learning Research Editor

Papers, benchmarks, and training economics — with the caveats spelled out.

The core technical claim is that replacing autoregression with diffusion allows Large Language Models to solve the “reverse curse”—the failure of standard LLMs to predict preceding text from context—without sacrificing generative quality. This hypothesis would be falsified if LLaDA-8B fails to match LLaMA3-8B on standard causal benchmarks or if its bidirectional nature introduces prohibitive inference latency that negates the training efficiency gains.

The “Reverse Curse” of Large Models Solved by Replacing Autoregression with Diffusion Models

I read the joint proposal from Renmin University’s Institute for AI (AIIS) and Ant Group introducing LLaDA (Large Language Diffusion with mAsking). Their central argument is that we should stop settling for next-token prediction when bidirectional understanding is required.

LLaDA-8B demonstrates capabilities in in-context learning that are comparable to LLaMA3-8B, while surpassing GPT-4o in reverse poetry tasks.

In the field of large language models, reverse poetry is a specialized task used to evaluate a model’s ability to handle bidirectional dependencies and logical reasoning within language models. For example, it asks the model to generate the preceding line for “A row of white egrets ascends into the blue sky.”

Typically, autoregressive models (such as GPT) perform suboptimally when inferring previous text from subsequent context. This is because the fundamental principle of autoregressive models is to use preceding elements in a sequence to predict the current element—i.e., predicting the next token.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 2

In contrast, LLaDA is a bidirectional model based on diffusion models, which naturally captures bidirectional dependencies in text more effectively.

I think bidirectional attention often struggles with long-context coherence compared to causal masks; I suspect this advantage may degrade as sequence length increases beyond the training distribution.

The authors state in the abstract that LLaDA challenges the inherent connection between key capabilities of Large Language Models (LLMs) and autoregressive models.

These findings have sparked considerable discussion.

Some observers have asked:

Are we reconstructing masked language model modeling?

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 3

Could this paradigm also perform better in RAG and embedding similarity search?

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 4

Notably, LLaDA was trained on 2.3 trillion tokens of corpus using only 130,000 H800 GPU hours, followed by Supervised Fine-Tuning (SFT) on 4.5 million token pairs.

From the paper, the efficiency claim relies heavily on the specific hardware mix and parallelization strategy; I cannot verify if this scaling law holds on consumer-grade GPUs without access to their training logs.

Rewriting Autoregression: The Case for Forward Masking

I read the paper’s claim that bidirectional masking outperforms autoregression; this hinges on the assumption that parallel training scales linearly with compute. One caveat: the “reverse curse” breakthrough is impressive, but I remain skeptical about its generalizability to open-ended creative writing tasks. I think comparing LLaDA-8B against LLaMA2-7B and LLaMA3-8B requires careful normalization of evaluation protocols to be truly fair.

The central technical question driving this work from Renmin University and Ant Group is straightforward: Is autoregression the only viable path to Large Language Model (LLM) intelligence? Current autoregressive models generate tokens sequentially, a process that incurs high computational costs during inference and limits performance in reverse reasoning tasks due to their left-to-right bias. To address these constraints, the authors propose LLaDA (Large Language Diffusion Models), which utilizes a forward masking and reverse prediction mechanism to capture bidirectional dependencies in text more effectively.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 5

The study follows standard data preparation, pre-training, Supervised Fine-Tuning (SFT), and evaluation pipelines to scale LLaDA to 8 billion parameters. The model was pretrained from scratch on 2.3 trillion tokens, consuming 130,000 H800 GPU hours, followed by SFT on 4.5 million data pairs.

Performance across diverse tasks—including language understanding, mathematics, coding, and Chinese—demonstrates the following:

Strong Scalability: LLaDA scales effectively to $10^{23}$ FLOPs of computational resources. On six benchmark tasks (such as MMLU and GSM8K), it achieves results comparable to self-built autoregressive baseline models trained on identical data.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 6

In-Context Learning: Notably, LLaDA-8B surpasses LLaMA2-7B in almost all 15 standard zero-shot/few-shot learning tasks and performs comparably to LLaMA3-8B.

Instruction Following: LLaDA significantly enhances instruction-following capabilities after SFT, as demonstrated in case studies such as multi-turn conversations.

Reverse Reasoning: LLaDA effectively breaks the reverse curse, performing consistently on both forward and reverse tasks. Specifically, in the reverse poetry completion task, LLaDA outperforms GPT-4o.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 7

LLaDA employs a Transformer architecture as a masked predictor. Unlike autoregressive models, LLaDA’s Transformer does not use causal masking (Causal Mask), allowing it to observe all tokens in the input sequence simultaneously. While its parameter count is comparable to traditional large language models like GPT, architectural details—such as multi-head attention settings—differ slightly to accommodate masked prediction tasks.

Its forward masking process operates as follows:

LLaDA employs a random masking mechanism. For an input sequence $x_0$, the model randomly selects a proportion of tokens to mask, generating a partially masked sequence $x_t$. The probability of each token being masked is $t$, where $t$ is sampled uniformly from [0,1]. This differs from traditional fixed masking ratios (such as 15% in BERT); LLaDA’s random masking mechanism demonstrates better performance on large-scale data.

The model’s objective is to learn a masked predictor capable of predicting the masked tokens based on the partially masked sequence $x_t$. During training, the model calculates loss only for the masked tokens.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 8

Where $1[\cdot]$ is an indicator function, signifying that loss is calculated only for masked tokens.

In the SFT phase, LLaDA uses supervised data (such as dialogue pairs and instruction-response pairs) to further optimize the model, improving its performance on specific tasks. For each task, the model fine-tunes based on the characteristics of the task data. For example, in conversational generation tasks, the model learns how to generate appropriate responses given a conversation history.

During SFT, the model selectively masks response tokens based on task-specific data characteristics, enabling it to better learn task-relevant patterns.

For inference, in generative tasks, LLaDA generates text through a reverse sampling process. **Starting from a fully masked sequence, it progressively predicts the masked tokens until complete text i

Sampling Strategies and Benchmark Performance

During sampling, LLaDA employs various strategies (such as random remasking, low-confidence remasking, and semi-autoregressive remasking) to balance generation efficiency and quality. I read the methodology closely; while these strategies aim for a trade-off, the reproducibility of “low-confidence” thresholds across different hardware setups remains an open question.

From the paper, the reliance on specific masking heuristics suggests performance may degrade under distribution shift.

In conditional probability evaluation tasks, LLaDA assesses the model’s conditional probability based on a given prompt and partially masked response. This allows LLaDA to be evaluated across various benchmark tasks. I followed the release notes; this approach is clever but assumes that partial masking preserves semantic integrity as well as autoregressive next-token prediction does.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 9

The performance of pre-trained LLMs on different benchmarks is as follows.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 10

Performance on different benchmarks after post-training is shown below. Note that LLaDA underwent only SFT, while other models underwent additional Reinforcement Learning alignment. This discrepancy in training pipelines makes direct comparison difficult; I suspect the RL-aligned competitors have a significant edge in instruction following that isn’t captured by raw pre-training metrics alone.

One caveat: comparing SFT-only models against RLHF-tuned systems introduces a confounding variable in performance evaluation.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 11

In reverse poetry tasks, LLaDA surpassed GPT-4o. I read this claim with skepticism; “reverse poetry” is a niche creative task, and beating a frontier model here doesn’t necessarily translate to robust generalization on standard logical benchmarks.

I think success in constrained creative tasks like reverse poetry does not validate broad reasoning capabilities.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 12

LLaDA’s performance in multi-turn dialogue tasks is shown below. Darker colors indicate tokens predicted in the later stages of sampling, while lighter colors indicate tokens predicted in the early stages. The visualization highlights how diffusion models distribute prediction uncertainty over time, though I note that this temporal distribution doesn’t explicitly address error propagation in long-context scenarios.

From the paper, visualizing token prediction timing is interesting but doesn’t guarantee stability in extended multi-turn interactions.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 13

Netizens: Looking Forward to Practical Application

I read through the practical demonstrations of LLaDA released by the research team. The model appears capable of solving standard mathematical reasoning problems, as shown in the accompanying figures. It also handles programming tasks with apparent competence.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 14

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 15

One caveat: visual success on curated examples does not guarantee robustness on out-of-distribution benchmarks.

A foreign netizen commented: “This will certainly push Chinese AI research to focus more on smaller models. However, this does not mean they are abandoning scaling laws.” Others suggested that this architecture might open up possibilities for hybrid models. Some also mentioned that Meta has conducted similar work combining Transformers and diffusion.

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 16

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 17

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 18

I think the claim that diffusion models favor smaller architectures needs verification against current scaling trends.

Of course, some expressed concern that many architectures surpassing Transformers have been proposed previously, yet none have been truly adopted by academia or industry. As one observer noted: “Let’s wait and see what happens next.”

Wow! Large Language Diffusion Models Are Here—Why Settle for Predicting the Next Token? | Renmin … — figure 19

This research was jointly conducted by the Institute for AI at Renmin University of China and Ant Group. The corresponding author is Chongxuan Li, currently an Associate Professor (Tenure-Track) at the Institute for AI, Renmin University. His current focus is on deep generative models: understanding the capabilities and limitations of existing models to design effective and scalable next-generation architectures.

Paper Link:
https://arxiv.org/abs/2502.09992 Project Homepage:
https://ml-gsai.github.io/LLaDA-demo/