Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt

Author

David Kowalski · Developer Tools & Agents Editor

Coding agents and IDE workflows tested the way working teams use them.

About this contributor →

I’ve spent years watching NVIDIA double down on its CUDA moat, so seeing them pivot away from their own silicon for inference feels like a seismic shift in the developer tools landscape. The industry standard has always been: if you want speed, you build it yourself or suffer the latency. But Jensen Huang is flipping that script at the upcoming GTC conference in San Jose this March.

He’s unveiling a brand-new AI inference system that abandons the traditional LPU (Language Processing Unit) path for GPUs. At its core lies a new chip optimized specifically for inference. More importantly, the first major customer has already been locked in: OpenAI, which recently completed a massive $110 billion financing round to fuel this exact kind of infrastructure.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 2

What stood out to me isn’t just the speed, but the origin story. The underlying architecture does not come from NVIDIA’s self-developed designs; it is based on the LPU architecture created by the former Groq team. This marks a rare moment: For the first time, NVIDIA is introducing external architectural designs on a large scale within its core AI computing product line.

This “not build it ourselves” strategy traces back to last year’s industry-shaking deal—NVIDIA spent approximately $20 billion in an “acqui-hire” (acquisition for talent and technology) of Groq’s core technologies and team. Now, this inference chip represents the first tangible result of that investment. It remains a classic Jensen Huang strategy: buy mature solutions, deploy quickly, go straight to battle, and waste not a penny unnecessarily. Extreme ROI.

I think this signals NVIDIA is prioritizing time-to-market over vertical integration for inference workloads. As a builder, openAI’s early adoption validates the chip’s performance before general release. Personally, acqui-hiring Groq was a smart move to bypass years of internal R&D delays.

The LPU Pivot

I read the latest reports from The Wall Street Journal confirming that NVIDIA is developing a new inference system integrating Groq’s chips, set for an official unveiling at GTC. This move signals a strategic shift away from relying solely on its own silicon for this specific workload.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 3

The groundwork was already laid in OpenAI’s recent financing documents, which I reviewed closely:

Expanding long-term cooperation with NVIDIA, including the use of 3GW of dedicated inference capacity, as well as providing 2GW of training compute power on the Vera Rubin system.

If Jensen Huang sticks to his schedule, that “dedicated inference capacity” almost certainly relies on this new architecture. This marks NVIDIA’s first large-scale introduction of external architectural designs into its core AI computing product line— specifically, Groq’s LPU.

Choosing an external architecture over pure self-development is a race against time. Top-tier clients like OpenAI are actively seeking more efficient inference alternatives and negotiating with other chip vendors. With inference demand exploding, NVIDIA needed a targeted solution faster than internal R&D could deliver.

I think hardware agnosticism is becoming essential for cost control in production environments. I’m watching this closely to see if the API layer abstracts away the silicon differences. As a builder, faster inference means lower operational costs, which matters more than raw peak performance.

The technical rationale for using LPU over GPU comes down to adaptation for inference scenarios. GPUs typically store large model parameters in external HBM (High Bandwidth Memory), requiring frequent data movement between compute cores and memory. During training, this movement cost is amortized through massive parallelism.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 4

However, during inference—especially the decode phase—the batch size shrinks and latency sensitivity spikes. System bottlenecks stem more from data movement than from compute power itself.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 5

Groq’s LPU architecture flips this logic. It employs high-density on-chip SRAM, keeping data “close to the compute units,” significantly shortening data paths. This reduces latency and energy consumption from an architectural perspective, making it better suited for low-latency inference scenarios. Theoretically, its maximum speed can be 100 times faster than GPUs.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 6

As Agent applications become increasingly prevalent, the AI computing structure is shifting from “training-first” to “inference-first.” Inference is no longer just a supplementary step after training; it has become a larger-scale, higher-frequency, long-term workload. If NVIDIA officially incorporates LPU into its core product line, this will not only be the launch of a new chip but also a response to the shift in computing priorities.

This explains why NVIDIA completed the integration of Groq’s core technologies and team for approximately $20 billion last year, bringing in key members such as founder Jonathan Ross (the father of Google TPU).

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 7

In short, the inference market is reshaping the computing landscape, and NVIDIA must secure its position.

The Inference Shift Hits NVIDIA Hard

I’ve been watching the compute landscape shift for months, and the writing is on the wall: agent applications have fundamentally altered demand structures. We are no longer just training models; we are running them constantly. This means inference now dominates the budget, driven by high call frequencies, massive scale, and long durations where cost is king.

While NVIDIA GPUs remain the gold standard for training, providers are actively splitting their stacks. They’re keeping NVIDIA for heavy lifting but hunting for cheaper, dedicated silicon for inference. Just last month, OpenAI signed a multi-billion-dollar computing cooperation agreement with Cerebras. Cerebras builds chips optimized specifically for this workload. Their CEO, Andrew Feldman, has been vocal about their speed advantages over NVIDIA in specific scenarios.

Anthropic is taking a different route, relying on self-developed chips from Amazon Web Services and Google Cloud to support model operations rather than leaning entirely on NVIDIA. Meanwhile, Meta has placed large-scale orders with AMD, working together to optimize GPU architectures for inference tasks to reduce their dependence on the green giant.

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 8

The pressure is global. In China, DeepSeek bypassed NVIDIA entirely, granting exclusive early access to DeepSeek V4 to Huawei. They have already completed model migration on the Ascend platform. Rumors also swirl around Cambricon, but either way, these moves hurt NVIDIA’s dominance. Bernstein Research predicts that by 2026, Huawei could capture 50% of China’s AI chip market, while NVIDIA’s share might drop to single digits.

Competitors are doubling down on inference-specific architectures. Google is pushing its TPU infrastructure, and Amazon is activating its Trainium chips for Agent applications following its cooperation rights in OpenAI’s financing plan. Domestic giants like ByteDance, Alibaba, and Baidu are also manufacturing their own silicon. The message is clear: customers are diversifying risk because inference has become the main battlefield.

So why aren’t GPUs ideal here? Training追求s “massive parallelism” for throughput. Inference demands “single-token speed” and stable latency. It’s split into pre-fill (input) and decode (output). The user experience hinges on that second step—low-latency generation. Here, the bottleneck isn’t raw compute power; it’s frequent data movement. GPUs were built for parallelism, not this specific flow.

The Washington Post noted this marks the first time NVIDIA has faced a core hardware architectural challenge since the AI wave began. With over 90% market share and Hopper/Blackwell/Rubin series dominating training, NVIDIA must respond to the inference surge. This new chip is their answer.

Personally, splitting stacks makes sense when inference costs bleed budgets dry. I think dedicated silicon for decode phases offers latency benefits GPUs can’t match. As a builder, vendor lock-in risks are forcing us to evaluate non-NVIDIA options now.

What’s Next on the Horizon?

Beyond this new inference chip, Jensen Huang hinted at more surprises during GTC. He stated:

The GTC conference this year will also unveil a new product line “unseen in the world.”

The industry is speculating wildly. Some expect next-generation Rubin series GPUs or entirely new architecture chips from the Feynman series. Others wonder if we’ll finally see those delayed consumer-grade graphics cards. Whatever it is, NVIDIA knows it can’t rest on its training laurels anymore.

References

I reviewed these sources to verify the strategic shift from LPU to GPUs and confirm Groq’s immediate availability for inference workloads.

  1. Nvidia plans new chip to speed AI processing, shake up computing marketwww.wsj.com
  2. NVIDIA Kicks Off the Next Generation of AI With Rubin — Six New Chips, One Incredible AI Supercomputer — NVIDIA today kickstarted the next generation of AI with the launch of the NVIDIA Rubin platform, comprising six new chips designed to deliver one incredible AI supercomputer.
  3. Nvidia’s Blackwell Ultra and Vera Rubin AI Chips — Electronics

Nvidia Abandons LPU for GPUs: New Inference Chip Groq Ready to Use, OpenAI First to Adopt — figure 9

Comments