DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award

Author

Amara Okonkwo · Robotics & Embodied AI Editor

Humanoids, industrial robots, and what is demo vs. deployed.

About this contributor →

I read the ACL 2025 proceedings, and while the ceremony was flashy, the real story is in the code. DeepSeek’s Liang Wenfeng took home Best Paper for a mechanism that claims to untether long-context processing from quadratic complexity. But as I look at these benchmarks, I’m thinking about what happens when this hits production servers under load, not just on a clean test set.

The ACL 2025 Best Paper and the NSA Mechanism

At the ACL 2025 awards ceremony, a paper co-authored by Liang Wenfeng of DeepSeek as the corresponding author and jointly published with Peking University and other institutions was awarded Best Paper.

ACL 2025 saw unprecedented scale, with total submissions reaching 8,360—nearly double last year’s 4,407—resulting in exceptionally fierce competition.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 2

In simple terms, their proposed Native Sparse Attention (NSA) mechanism boosts long-text processing speed by 11 times through synergistic optimization of algorithms and hardware. Remarkably, performance not only remained stable but surpassed that of traditional full attention models.

First author Yuan Jingyang presented the work at the conference, revealing that this technology can extend context length to one million tokens and will be applied to the next frontier model.

Given that the paper was published after the release of DeepSeek-R1, the experimental setup also noted that new models were fine-tuned using distillation data from DeepSeek-R1.

This has led to widespread speculation that the technology will be integrated into the next-generation DeepSeek-V4 and DeepSeek-R2.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 3

I think lab benchmarks are clean, but real-world inference costs will tell if this sparse attention actually saves money. In the field, one million tokens sounds great until you see the memory overhead on consumer GPUs. I want to see how this handles noisy data, not just distilled perfection.

Slimming Down Attention Mechanisms, Speed Surges by 11x

For a long time, processing long texts with large language models has been like dancing in shackles. The computational complexity of traditional full attention mechanisms grows quadratically with sequence length; when handling text of 64k length, attention calculations account for 70-80% of total latency.

The solution proposed in this paper is ingenious: since not all word relationships are equally important, why not teach the model to “focus on the key points”?

NSA employs a dynamic hierarchical sparse strategy, operating through three parallel attention branches working in synergy:

  • Compressed Attention, responsible for capturing coarse-grained global information patterns, akin to quickly scanning the entire text to grasp the main idea;
  • Selective Attention, which focuses on the most important word blocks within the sequence, equivalent to carefully reading key paragraphs;
  • Sliding Attention, tasked with acquiring local contextual information to ensure no details are lost.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 4

The most ingenious aspect of this design is that it does not simply discard information but balances computational density through carefully designed algorithms.

More importantly, the entire architecture has been deeply optimized for modern GPU hardware, achieving an end-to-end natively trainable mode.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 5

In practical tests, when processing sequences of 64k length, NSA demonstrated astonishing speed advantages throughout the entire lifecycle of decoding, forward propagation, and backward propagation.

Decoding speed increased by 11.6 times, forward propagation by 9 times, and backward propagation saw a 6x acceleration, delivering tangible efficiency gains for both model inference and training.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 6

Speed Isn’t the Only Metric That Matters

I read the leak carefully, and while the latency claims get the headlines, the benchmark deltas are where the actual engineering story lives. In the field, we don’t care about theoretical speed if the model hallucinates when it gets there. The 27B parameter NSA model beat the full attention baseline in 7 out of 9 general metrics, which is a solid start but not a revolution.

What I watch for is seven out of nine wins is decent, but I’d rather see consistency across all tasks than selective victories.

The real numbers that caught my eye were in reasoning-heavy benchmarks. DROP improved by 0.042 and GSM8K by 0.034. This suggests the sparse attention mechanism isn’t just skipping tokens; it’s forcing the model to prioritize key information rather than drowning in noise. That is a tangible benefit for unit economics, as you pay less for inference on cleaner signals.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 7

Long-context retrieval is where this architecture claims to shine. In the “needle in a haystack” test with a 64k context, NSA achieved perfect retrieval accuracy at all positions. On LongBench, it scored an average of 0.469, surpassing the full attention baseline by +0.032 and leading other sparse methods. Perfect retrieval sounds great on a slide deck, but I’ve seen systems fail when the “needle” is buried in dense technical documentation rather than simple text.

I think perfect accuracy in a controlled haystack test doesn’t guarantee reliability in messy real-world logs.

The multi-hop reasoning results were more nuanced. NSA improved by 0.087 on HPQ and 0.051 on 2Wiki compared to full attention. It also increased by 0.069 in code understanding (LCC) and by 0.075 in passage retrieval (PassR-en). These gains are modest but consistent, suggesting the model retains structural integrity better than previous sparse attempts.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 8 DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 9

The team ran a specific experiment that I found telling: they fine-tuned the model on mathematical reasoning data from DeepSeek-R1 and tested it on AIME 24. Under an 8k context, NSA-R achieved an accuracy of 0.121, whereas the full attention model only reached 0.046. Even at 16k context, NSA-R maintained 0.146 accuracy, far exceeding the full attention model’s 0.092.

In the field, a threefold improvement in math reasoning on short contexts is impressive, but can it scale without exploding costs?

These results claim that NSA doesn’t sacrifice performance for speed, achieving a win-win in efficiency and capability. I followed the release with skepticism; usually, “efficiency” means cutting corners on accuracy. Here, the data suggests they actually improved reasoning by forcing focus. That is a rare outcome in our industry, where we often have to choose between being fast or being smart.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 10

Three More Things

Four Best Papers took home awards this year, and while the headlines often chase the biggest names, these three papers dig into the messy reality of how models actually behave under pressure. I followed the release notes closely to see what stuck.

“Language Models Resist Alignment: Evidence From Data Compression” by the Peking University team.

This study investigates the “elasticity” of large language models, referring to how easily models revert to their pre-training states after subsequent fine-tuning, despite having undergone alignment training (to align with human values and reduce harmful outputs), much like a stretched spring snapping back.

This implies that current alignment methods may only superficially alter models without being robust. Future efforts require more effective alignment techniques to ensure models stably meet human needs, particularly in open-source models where malicious fine-tuning could easily compromise safety mechanisms.

What I watch for is if the model snaps back after fine-tuning, our safety layers are just window dressing.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 11

“Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs” by the Stanford team.

This paper explores a new perspective on “fairness” in large models: “difference awareness.” Simply put, models should make distinctions between different groups in appropriate scenarios rather than treating everything uniformly.

The study found that models performing well in traditional fairness tests did not score high on “difference awareness”; while stronger model capabilities (e.g., higher MMLU scores) correlated with better situational awareness, they did not necessarily improve difference awareness; existing “de-biasing” methods (such as prompting the model to “remain unbiased”) actually caused models to ignore differences more, even misidentifying correct answers.

I think blindly forcing uniformity breaks logic when real-world distinctions are necessary for accuracy.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 12

“A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive” by the Helmholtz Information Security Center team and others.

This paper points out that the sampling mechanism used by large models when generating responses is similar to human decision-making, containing descriptive components (reflecting statistical norms of concepts) and prescriptive components (implying ideal states of concepts).

Experiments verified that whether dealing with novel or existing concepts (covering 500 concepts across 10 domains), samples generated by LLMs deviate from statistical averages toward their perceived “ideal values,” a phenomenon significantly present across 15 different models. Case studies suggest this bias could lead to skewed decisions in fields like healthcare, raising ethical concerns.

In the field, models optimizing for an “ideal” rather than the truth is a liability in high-stakes deployments.

DeepSeek's Next-Gen Tech Leaked Early; Liang Wenfeng Wins ACL2025 Best Paper Award — figure 13

Link to the DeepSeek paper: https://arxiv.org/abs/2502.11089

Comments