New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fundamentally Changes Language Models

Models & Benchmarks · Published: Jul 09, 2024 · James Hayes · ~9 min read

Author

James Hayes · Cloud & MLOps Staff Writer

Shipping models: inference, observability, cost, and what breaks in production.

New RNN Architecture Challenges Transformer Again!

The production implication here is immediate: if we can replace self-attention with a learnable hidden state, we might finally cut the latency and memory costs that are currently strangling our inference budgets. This isn’t just another incremental tweak; it’s a structural shift in how we handle context windows.

Core Idea: Replace the hidden state in RNNs with a learnable model. It can even learn during inference, which is why this method is called TTT (Test-Time Training).

Karen Dalal, co-first author from UC Berkeley, stated: “I believe this will fundamentally change language models.”

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 2

A TTT layer possesses hidden states with greater expressive power than those in RNNs, allowing it to directly replace the computationally expensive self-attention layers in Transformers.

In experiments, TTT-Linear, where the hidden state is a linear model, outperformed both Transformer and Mamba, achieving lower perplexity with less computational cost (left) and better utilization of long contexts (right).

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 3

In practice, test-time training adds inference latency; we need to know the cold-start penalty before deploying. I think linear models are cheap, but can they handle our complex semantic queries without hallucinating?

Furthermore, TTT-MLP, where the hidden state is an MLP model, performed even better at 32k long contexts.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 4

Operationally, 32k context is nice, but does the memory footprint fit in our current GPU clusters?

Karen Dalal also pointed out that theoretically, learnable hidden states can be any model. For longer contexts, they could be CNNs or even complete Transformers nested within each other.

The recently released TTT paper has already attracted attention and discussion in the academic community. Andrew Gao, a PhD student at Stanford, believes that this paper might become the next “Attention Is All You Need.”

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 5

Others have noted whether these new architectures can truly surpass Transformers depends on their scalability to larger sizes.

Karen Dalal revealed that a 7B parameter model will be released soon.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 6

The Cost of Context: Why Linear Beats Quadratic

I’ve watched the industry burn GPU hours on attention mechanisms that scale quadratically with context length. It’s a latency and cost nightmare for anyone running long-context inference in production. Traditional RNNs were too rigid, but Transformers are too expensive at scale.

Recent iterations like RWKV and Mamba tried to bridge this gap by combining linear attention benefits with parallel training capabilities. They perform well on short contexts, yet Transformers still hold the crown for ultra-long sequences beyond 32k tokens. The trade-off has always been between hardware efficiency and raw expressive power.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 7

The Time-Transformer-Tuning (TTT) team proposes a shift in perspective. Instead of passive hidden states, they argue these states should actively learn to compress context with minimal parameters. This approach maintains fixed model size while boosting expressive power over time. It’s essentially meta-learning applied to sequence modeling.

Wang Xiaolong, Assistant Professor at UCSD and co-supervisor of the paper, noted:

Transformers explicitly store all input tokens. If you consider neural networks to be an effective method for compressing information, then compressing these tokens is also meaningful.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 8

This architecture keeps time complexity linear, which is critical for reducing inference latency. By decomposing sequence modeling into two nested learning loops, the outer loop handles language modeling while the inner loop compresses context via self-supervised learning. The parameters of the outer loop effectively become hyperparameters for the inner one.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 9

Unlike standard meta-learning that adapts to different tasks, TTT adapts the model to each individual test sample. While a single sample contains limited information, it is sufficient for training this hidden state model. This allows for rapid adaptation without retraining the entire network.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 10

The framework is flexible; if the inner loop uses a linear model, it equates to linear attention. If it employs a Nadaraya-Watson estimator, TTT becomes equivalent to self-attention. This theoretical unification suggests we can get the best of both worlds: linear efficiency with transformer-like expressiveness.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 11

In practice, linear complexity means predictable latency spikes, not exponential ones. I think meta-learning per sample adds inference overhead that might kill throughput. Operationally, if it matches Transformer accuracy at lower cost, we deploy it immediately.

Learning at Test Time

The production implication here is that we are moving from static weights to dynamic inference-time adaptation. That sounds great until you realize every token update requires a gradient step. I read the details on how TTT compresses context into hidden states, and it’s fundamentally different from standard RNNs. The team uses self-supervised learning to treat the context as an unlabeled dataset.

The context serves as an unlabeled dataset. This means the model isn’t just passing a vector; it’s actively training on the fly.

Instead of a fixed vector, the hidden state becomes a linear model or small neural network. The update rule applies one step of gradient descent on the self-supervised loss for each input. In practice, if every token triggers a backward pass, your GPU memory will scream before your latency metrics do. This approach allows the hidden state to remember inputs that generate large gradients, offering better fitting than selective forgetting. It effectively trains different parameters for each sequence during inference.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 12

The naive TTT layer works but cannot be parallelized. That’s a hard stop for most inference engines. The team proposed mini-batch gradient descent to solve this parallelization bottleneck. By using the Dual form method, they calculate weights and output tokens only at the end of each mini-batch. This avoids redundant computations during the sequence. Their JAX implementation is more than five times faster than the naive approach.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 13

Can TTT Become the “Transformer Killer”?

With theoretical feasibility established, how does TTT perform in experiments? The simplest test would be replacing self-attention layers directly. However, the team found that modern RNN backbones like Mamba include temporal convolutions before the RNN layer, which also benefit TTT. Therefore, in experiments, TTT-Linear and TTT-MLP were primarily applied to the Mamba backbone, with other training details strictly following the settings in the Mamba paper.

In short-context tests on the Pile dataset:

At 2k context length, TTT-Linear, Mamba, and Transformer exhibited comparable performance, while TTT-MLP performed slightly worse.
At 8k context length, both TTT-Linear and TTT-MLP outperformed Mamba and Transformer. The TTT-MLP applied to a Transformer backbone (T) with approximately 1.3B parameters also performed slightly better than Mamba.

Overall, as the context length increases, the advantage of the TTT layer over Mamba expands. The team hypothesizes that linear models have less expressive power than MLPs, thus benefiting more from the convolutions in the Mamba backbone. I think swapping attention for dynamic state updates is a significant architectural shift; ensure your serving infrastructure supports this variability before committing.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 14

Long-context experiments used a subset of the Pile dataset, Books3:

At 32k context length, both TTT-Linear and TTT-MLP outperformed Mamba, similar to observations at 8k on Pile. Even the TTT-MLP with a Transformer backbone (T) performed slightly better than Mamba.
At the 1.3B parameter scale, TTT-MLP (T) was only slightly worse than TTT-MLP (M). The Transformer backbone may be more suitable for larger models and longer contexts outside the scope of this paper’s evaluation.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 15

In speed tests on A100 GPUs, TTT-Linear was slightly faster than Mamba during the prefill phase and nearly identical in speed during decoding. Compared to Transformers, TTT-MLP also maintained an advantage of linear complexity overall. Operationally, linear scaling is attractive for cost, but if the per-token compute overhead is high, your total bill might not drop as much as you hope.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 16

Co-first author Karan Dalal stated: “A question I am constantly asked is whether we believe TTT is the ‘Transformer killer.’ I still think we need to continue working hard.”

Hidden states can be any model, but current research only involves linear models and small MLPs; more complex ones remain to be studied.

The learning of hidden state models could use Adam instead of standard gradient descent, among other optimizations.

Video Modeling Implications

The three co-first authors bring diverse backgrounds to this work: Dr. Yu Sun, a UC Berkeley graduate and current Stanford postdoc; Xinhao Li, who holds a master’s from UCSD and an undergraduate degree from UESTC; and Karan Dalel, a UC Berkeley alum currently interning at robotics startup 1X.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 17

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 18

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 19

References: RNN Architecture Surpasses Transformer: Each Hidden State is a Model, Lead Author Claims Fundamental Shift in Language Modeling

I followed the release notes from Wang Xiaolong, an assistant professor at UCSD and co-author of the study, who revealed that the TTT method applies to video modeling, not just language. He calls TTT a “Transformer killer,” though I believe we still need further engineering efforts before this is production-ready.

In the future, when modeling long videos, we can sample frames densely rather than at 1 FPS. These dense frames are a burden for Transformers but a boon for TTT layers.

New RNN Architecture Surpasses Transformer: Each Hidden State Is a Model, First Author Says It Fu… — figure 20

In practice, dense frame sampling sounds great for accuracy but will spike memory pressure on edge devices. I think if TTT scales linearly, we might finally offload inference from expensive GPU clusters. Operationally, lab demos of video modeling rarely survive the transition to real-time streaming constraints.

Paper link: https://arxiv.org/abs/2407.04620