The AI race just hit a new speed bump, and it’s wearing an xAI logo. Just now, Elon Musk’s company unveiled Grok 3 in a livestream watched by over 3 million people, claiming this model has shattered previous performance ceilings.
According to the official evaluation from the Arena (lmarena.ai), Grok 3 is “the first model to break the 1400-point mark and ranks first in all categories.” That’s not just a win; it’s a statistical anomaly on their leaderboard.

What stands out to me is the infrastructure behind this claim. Grok 3 is reportedly the first model trained on a cluster of 100,000 H100 GPUs, later expanded to 200,000. That’s a massive compute footprint that suggests xAI isn’t just tweaking weights—they’re throwing hardware at the problem.
In his teaser for the release, Musk praised Grok 3 highly, calling it “the smartest AI on Earth.” It’s bold language, and in this industry, bold claims usually invite immediate scrutiny from competitors and skeptics alike.

Before the official release, AI expert Karpathy gained early access. After playing with it for two hours, he published a long post detailing his impressions. His take is worth noting: Karpathy believes that Grok 3’s reasoning capabilities have reached SOTA (State of the Art), with inference performance comparable to o1-pro, and slightly better than DeepSeek R1 and Gemini’s inference models.
I think if it matches o1-pro, it’s ready for serious coding tasks. I’d wait for independent benchmarks before trusting these claims. As a builder, reasoning improvements matter more than raw chat speed for devs.
Considering Grok 3 was trained from scratch a year ago, achieving such results is truly incredible. The timeline suggests xAI moved fast to catch up with the leading edge of reasoning models.

Moreover, for classic large model “hard problems” such as comparing digits and decimals, Grok 3 correctly solved them after enabling reasoning. This is a specific pain point for many developers who have seen models fail at basic arithmetic logic in complex prompts.

However, some have questioned Grok’s status. One netizen spoofing Nvidia’s Jensen Huang commented that even if Grok 3 is truly the strongest, it will only stay on top for at most a week. The shelf life of these benchmarks is notoriously short in this market.
Coupled with OpenAI teasing its next-generation GPT plans, another Twitter battle between Musk and Sam Altman is about to erupt. The drama is as much part of the product launch as the code itself.

Altman also tweeted last night that testing GPT-4.5 gave him a stronger “feel the AGI” sensation than expected. He’s pushing back, claiming his own model is nearing a qualitative leap in intelligence.
Netizens in the comments section stirred things up, urging him to beat Musk to the punch and livestream the release of GPT-4.5 in the morning. The community is hungry for proof, not just promises.

Back to business, let’s look at what was discussed during the livestream.
Scaling Up for Reasoning: The Grok 3 Launch
I’ve been tracking xAI’s trajectory closely, and their latest move isn’t just about releasing another model—it’s a statement on compute dominance. Musk’s ‘Strongest on Earth’ Grok 3 makes waves by topping the Arena with over 1400 points, but the real story is how they got there: a massive infrastructure build-out paired with aggressive reasoning capabilities.
The Human Element Behind the Compute
Four people took the stage for this livestream introduction. Besides Musk, the most prominent figures were two Chinese individuals seated in the center; they are founding members of xAI.
From left to right:
- Jimmy Ba, a 2023 Sloan Prize winner and Assistant Professor under Geoffrey Hinton, who completed both his undergraduate and doctoral studies at the University of Toronto.
- Yuhuai (Tony) Wu, a postdoctoral researcher at Stanford University, who received his Ph.D. from the University of Toronto.
The person on the far left is Igor Babuschkin, an engineer at xAI.

The four first introduced the training process of Grok 3.
Last year, Musk teased that Grok 3 was being trained on 100,000 H100 GPUs, making it the first model to reach such a massive training cluster scale.
At the time, netizens called this a “super factory” for neural networks.

Today at the press conference, it was revealed that by day 92 of training, the cluster scale expanded to 200,000 cards.

Personally, that scale is impressive, but I’m more interested in whether the inference costs will ever match that training spend. I think xAI’s hiring of top academic talent suggests they’re betting on research depth over just brute force. As a builder, for developers, the key question remains: can we access this power without waiting for a press conference?
Reasoning Capabilities and Benchmark Dominance
With such powerful computing power, xAI also followed the trend by introducing Chain-of-Thought reasoning capabilities in Grok 3.
At a summit in Dubai earlier, Musk proudly declared:
Grok 3 has strong reasoning capabilities and is smarter than all currently known models.

This wave of Grok 3 comes in two versions: Full and Mini. Both outperformed non-reasoning models like GPT-4o and DeepSeek-V3 on datasets for mathematics, science, and code.

Additionally, in its early stages under the alias “Chocolate,” Grok 3 topped the LMSYS leaderboard, becoming the only model to score over 1400.

Building on the base Grok 3 and Mini models, the xAI team also created two reasoning models.
The reasoning model based on Mini (Grok 3 mini Reasoning) is relatively mature, while the one based on the Full version (Grok 3 Reasoning Beta) is still in the Beta stage.
Live Demos: Physics and Game Logic
Before presenting the results, the four used Musk’s account to run two cases with Grok, related to physics and gaming respectively.
Generate code to create a 3D animated chart depicting a launch from Earth to Mars, followed by a return to Earth during the next launch window.

During the generation process, someone joked about when Grok could be installed on SpaceX rockets. Musk responded that it might take another two years.
Musk also stated that if everything goes smoothly, SpaceX plans to send the Optimus robot to Mars via Starship around November 2025, during the next Earth-Mars transfer window.
Returning to Grok, after considering Kepler’s laws and converting them into code, it ultimately generated code capable of producing such an animation:

The second question activated Big Brain mode, allowing the model to use more computing resources for deeper thinking.
The prompt required using the pygame library to design a game that combines Tetris and Bejeweled.
It also hinted that the code might be long, needing to be saved in a single file, and should be “insanely great.”

Grok 3 lived up to expectations, successfully combining these two games and introducing the features of the hybrid version:
Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points

When run, it looks like this: it retains Tetris’s elimination mechanics but adjusts them based on Bejeweled’s traits to require three blocks for a match.

Looking at the benchmark results, both versions achieved impressive scores in mathematics, science, and coding tasks.
Furthermore, when prompted to “think more” (the lighter shaded area above the bars), their performance surpassed DeepSeek-R1 and the high-end o3-mini.

However, many models are currently showing signs of “overfitting” on benchmarks. So, how does Grok 3 perform in reality?
The R&D team challenged it with questions from this year’s AIME 2025 competition. The results showed that Grok-3 Reasoning Beta and mini Reasoning scored 93 and 90 respectively, outperforming other reasoning models.

In addition to the Grok 3 pre-trained model and the two reasoning models, xAI also released an AI Agent called DeepSearch.

This feature can be seen as xAI’s counterpart to the Deep Research functions recently launched by OpenAI, Google, and others.
In short, DeepSearch scans the internet and X (formerly Twitter) to analyze information and provides summaries to answer questions.

Regarding access, X Premium+ users can experience Grok 3 starting today.
On the standalone app, a SuperGrok subscription is required—$30/month or $300/year.
Personally, benchmark scores are nice, but I need to know if DeepSearch actually saves me time on real research tasks. I think the pricing structure feels steep for casual users who just want quick answers without the subscription hassle.
The Release Process Was Full of Twists; Voice Mode Delayed
I watched the rollout of Grok 3 unfold with a mix of excitement and skepticism. It wasn’t just another model drop; it was a chaotic, high-stakes event that felt more like a live performance than a standard software release.
Looking back at the entire process, it was indeed full of twists and turns. Last August, during an interview with popular podcaster Lex Fridman, Musk said that Grok 3 was expected to be released by the end of that year. However, the first test instance wasn’t published until January 19 this year, and the actual release has been delayed until now.

Moreover, just over the weekend before the release, the xAI team was still urgently refining Grok 3.

An xAI employee also shared their experience, noting that on
At 11:30 PM that night (3:30 PM Beijing time on Monday, less than 24 hours before the launch), Musk posted online stating he was still pulling an all-nighter to finish his work.

Just an hour and a half before the press conference, Musk suddenly announced that the voice feature originally planned for release would be delayed. Musk tweeted that the voice mode was still unstable and needed to be postponed by another week.

During the live Q&A session, a netizen asked about the specific release date. The team responded that an early version would go online soon, followed by gradual iterations. However, Shivon Zilis, an executive at Musk’s Neuralink, had already experienced Ara for one hour and posted her impressions in Beijing time earlier that morning.
Zilis described it as one of the most surprising and meaningful moments of her life. She discussed topics such as biology and quantum entanglement with Ara, and even asked Ara to create quiz questions to test her learning outcomes. Zilis only answered half of the questions correctly, but Ava patiently explained the remaining ones without dismissing any questions as too foolish.

Someone later asked in the comments if Ara was a voice version, to which Shivon confirmed that it was.

As a builder, late-stage feature delays erode trust in release timelines. I prefer stable text APIs over unstable voice prototypes. Personally, rushing releases creates more technical debt than it saves.
Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points
Seeking $10 Billion in New Financing and Entering the Gaming Sector
The timing of xAI’s latest move feels calculated. Just last Friday, Bloomberg reported that xAI was seeking a new round of financing worth approximately $10 billion, valuing the company at around $75 billion (approximately 545.46 billion RMB).
Existing investors, including Sequoia Capital, Andreessen Horowitz, and Valor Equity Partners, are in talks to participate in this funding round. Since the deal has not yet been finalized, the release of the new model is likely to have a certain impact on this financing round.

If these reports are confirmed, xAI’s fundraising speed is indeed astonishing. At the end of last December, the company had just completed a $6 billion Series C round, valuing it at $51 billion. In less than two months, the valuation jumped by approximately 47%. Looking further back, from the B to C rounds, the valuation doubled within six months. It can be said that xAI, established less than two years ago, has grown into a formidable rival to OpenAI.
With ample funding, xAI announced not only continued model development but also other strategic directions: Betting on the gaming sector by establishing an AI game studio.

Musk first hinted at this last November, complaining that “too many game studios are controlled by large corporations.”

Now, Musk’s business empire is expanding once again.

I think aI-generated games are a novelty, not a workflow tool. I’m skeptical about the immediate utility for developers. As a builder, this feels like brand expansion rather than technical innovation. The gaming angle is interesting but distant from my daily coding.
One More Thing
A few days before the Grok 3 launch, another dramatic incident sparked heated discussion. An xAI engineer (now a former employee) publicly posted a comparison of Grok 3’s coding abilities against several competitors. Although he clearly labeled this as his personal opinion, ranking his own model, Grok 3, fourth (with the top three spots taken by OpenAI models) caused controversy.

The employee later revealed that the company demanded he either delete the post or be fired, claiming the post exposed Grok 3’s existence. Upon hearing this, the engineer felt it was absurd, noting that everyone already knew about Grok 3, and even shared screenshots of Musk’s previous statements. Facing what he perceived as xAI’s petty behavior, the engineer decided to quit without hesitation, posting a lengthy explanation.
I will maintain my words and dignity, find another job, or start my own business. See you later.

Regarding this incident, Musk later responded that it was “weird,” but no further actions were reported.

More dramatically, due to a dispute over salary payments, the engineer later publicly posted again, tagging Musk:
Please do the right thing.

However, despite having “broken up,” the engineer who worked on Grok 3’s voice mode still set aside past grievances and helped promote Grok 3 multiple times. Moreover, the voice feature that Musk announced would be delayed today was indeed the work of this engineer’s team. Even after leaving, he remains proud of his contribution to the project.

That said, what do you think of this version of Grok 3? Once the next generation of GPT is released, can Musk maintain his lead?
Comments
Sign in to join the discussion and leave a comment.
Sign in with Google