Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points

Models & Benchmarks · Published: Feb 18, 2025 · David Kowalski · ~14 min read

Author

David Kowalski · Developer Tools & Agents Editor

Coding agents and IDE workflows tested the way working teams use them.

The AI race just hit a new speed bump, and it’s wearing an xAI logo. Just now, Elon Musk’s company unveiled Grok 3 in a livestream watched by over 3 million people, claiming this model has shattered previous performance ceilings.

According to the official evaluation from the Arena (lmarena.ai), Grok 3 is “the first model to break the 1400-point mark and ranks first in all categories.” That’s not just a win; it’s a statistical anomaly on their leaderboard.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 2

What stands out to me is the infrastructure behind this claim. Grok 3 is reportedly the first model trained on a cluster of 100,000 H100 GPUs, later expanded to 200,000. That’s a massive compute footprint that suggests xAI isn’t just tweaking weights—they’re throwing hardware at the problem.

In his teaser for the release, Musk praised Grok 3 highly, calling it “the smartest AI on Earth.” It’s bold language, and in this industry, bold claims usually invite immediate scrutiny from competitors and skeptics alike.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 3

Before the official release, AI expert Karpathy gained early access. After playing with it for two hours, he published a long post detailing his impressions. His take is worth noting: Karpathy believes that Grok 3’s reasoning capabilities have reached SOTA (State of the Art), with inference performance comparable to o1-pro, and slightly better than DeepSeek R1 and Gemini’s inference models.

I think if it matches o1-pro, it’s ready for serious coding tasks. I’d wait for independent benchmarks before trusting these claims. As a builder, reasoning improvements matter more than raw chat speed for devs.

Considering Grok 3 was trained from scratch a year ago, achieving such results is truly incredible. The timeline suggests xAI moved fast to catch up with the leading edge of reasoning models.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 4

Moreover, for classic large model “hard problems” such as comparing digits and decimals, Grok 3 correctly solved them after enabling reasoning. This is a specific pain point for many developers who have seen models fail at basic arithmetic logic in complex prompts.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 5

However, some have questioned Grok’s status. One netizen spoofing Nvidia’s Jensen Huang commented that even if Grok 3 is truly the strongest, it will only stay on top for at most a week. The shelf life of these benchmarks is notoriously short in this market.

Coupled with OpenAI teasing its next-generation GPT plans, another Twitter battle between Musk and Sam Altman is about to erupt. The drama is as much part of the product launch as the code itself.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 6

Altman also tweeted last night that testing GPT-4.5 gave him a stronger “feel the AGI” sensation than expected. He’s pushing back, claiming his own model is nearing a qualitative leap in intelligence.

Netizens in the comments section stirred things up, urging him to beat Musk to the punch and livestream the release of GPT-4.5 in the morning. The community is hungry for proof, not just promises.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 7

Back to business, let’s look at what was discussed during the livestream.

Scaling Up for Reasoning: The Grok 3 Launch

I’ve been tracking xAI’s trajectory closely, and their latest move isn’t just about releasing another model—it’s a statement on compute dominance. Musk’s ‘Strongest on Earth’ Grok 3 makes waves by topping the Arena with over 1400 points, but the real story is how they got there: a massive infrastructure build-out paired with aggressive reasoning capabilities.

The Human Element Behind the Compute

Four people took the stage for this livestream introduction. Besides Musk, the most prominent figures were two Chinese individuals seated in the center; they are founding members of xAI.

From left to right:

Jimmy Ba, a 2023 Sloan Prize winner and Assistant Professor under Geoffrey Hinton, who completed both his undergraduate and doctoral studies at the University of Toronto.
Yuhuai (Tony) Wu, a postdoctoral researcher at Stanford University, who received his Ph.D. from the University of Toronto.

The person on the far left is Igor Babuschkin, an engineer at xAI.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 8

The four first introduced the training process of Grok 3.

Last year, Musk teased that Grok 3 was being trained on 100,000 H100 GPUs, making it the first model to reach such a massive training cluster scale.

At the time, netizens called this a “super factory” for neural networks.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 9

Today at the press conference, it was revealed that by day 92 of training, the cluster scale expanded to 200,000 cards.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 10

Personally, that scale is impressive, but I’m more interested in whether the inference costs will ever match that training spend. I think xAI’s hiring of top academic talent suggests they’re betting on research depth over just brute force. As a builder, for developers, the key question remains: can we access this power without waiting for a press conference?

Reasoning Capabilities and Benchmark Dominance

With such powerful computing power, xAI also followed the trend by introducing Chain-of-Thought reasoning capabilities in Grok 3.

At a summit in Dubai earlier, Musk proudly declared:

Grok 3 has strong reasoning capabilities and is smarter than all currently known models.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 11

This wave of Grok 3 comes in two versions: Full and Mini. Both outperformed non-reasoning models like GPT-4o and DeepSeek-V3 on datasets for mathematics, science, and code.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 12

Additionally, in its early stages under the alias “Chocolate,” Grok 3 topped the LMSYS leaderboard, becoming the only model to score over 1400.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 13

Building on the base Grok 3 and Mini models, the xAI team also created two reasoning models.

The reasoning model based on Mini (Grok 3 mini Reasoning) is relatively mature, while the one based on the Full version (Grok 3 Reasoning Beta) is still in the Beta stage.

Live Demos: Physics and Game Logic

Before presenting the results, the four used Musk’s account to run two cases with Grok, related to physics and gaming respectively.

Generate code to create a 3D animated chart depicting a launch from Earth to Mars, followed by a return to Earth during the next launch window.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 14

During the generation process, someone joked about when Grok could be installed on SpaceX rockets. Musk responded that it might take another two years.

Musk also stated that if everything goes smoothly, SpaceX plans to send the Optimus robot to Mars via Starship around November 2025, during the next Earth-Mars transfer window.

Returning to Grok, after considering Kepler’s laws and converting them into code, it ultimately generated code capable of producing such an animation:

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 15.gif

The second question activated Big Brain mode, allowing the model to use more computing resources for deeper thinking.

The prompt required using the pygame library to design a game that combines Tetris and Bejeweled.

It also hinted that the code might be long, needing to be saved in a single file, and should be “insanely great.”

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 16

Grok 3 lived up to expectations, successfully combining these two games and introducing the features of the hybrid version:

Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 17

When run, it looks like this: it retains Tetris’s elimination mechanics but adjusts them based on Bejeweled’s traits to require three blocks for a match.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 18

Looking at the benchmark results, both versions achieved impressive scores in mathematics, science, and coding tasks.

Furthermore, when prompted to “think more” (the lighter shaded area above the bars), their performance surpassed DeepSeek-R1 and the high-end o3-mini.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 19

However, many models are currently showing signs of “overfitting” on benchmarks. So, how does Grok 3 perform in reality?

The R&D team challenged it with questions from this year’s AIME 2025 competition. The results showed that Grok-3 Reasoning Beta and mini Reasoning scored 93 and 90 respectively, outperforming other reasoning models.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 20

In addition to the Grok 3 pre-trained model and the two reasoning models, xAI also released an AI Agent called DeepSearch.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 21

This feature can be seen as xAI’s counterpart to the Deep Research functions recently launched by OpenAI, Google, and others.

In short, DeepSearch scans the internet and X (formerly Twitter) to analyze information and provides summaries to answer questions.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 22

Regarding access, X Premium+ users can experience Grok 3 starting today.

On the standalone app, a SuperGrok subscription is required—$30/month or $300/year.

Personally, benchmark scores are nice, but I need to know if DeepSearch actually saves me time on real research tasks. I think the pricing structure feels steep for casual users who just want quick answers without the subscription hassle.

The Release Process Was Full of Twists; Voice Mode Delayed

I watched the rollout of Grok 3 unfold with a mix of excitement and skepticism. It wasn’t just another model drop; it was a chaotic, high-stakes event that felt more like a live performance than a standard software release.

Looking back at the entire process, it was indeed full of twists and turns. Last August, during an interview with popular podcaster Lex Fridman, Musk said that Grok 3 was expected to be released by the end of that year. However, the first test instance wasn’t published until January 19 this year, and the actual release has been delayed until now.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 23

Moreover, just over the weekend before the release, the xAI team was still urgently refining Grok 3.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 24

An xAI employee also shared their experience, noting that on

At 11:30 PM that night (3:30 PM Beijing time on Monday, less than 24 hours before the launch), Musk posted online stating he was still pulling an all-nighter to finish his work.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 25

Just an hour and a half before the press conference, Musk suddenly announced that the voice feature originally planned for release would be delayed. Musk tweeted that the voice mode was still unstable and needed to be postponed by another week.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 26

During the live Q&A session, a netizen asked about the specific release date. The team responded that an early version would go online soon, followed by gradual iterations. However, Shivon Zilis, an executive at Musk’s Neuralink, had already experienced Ara for one hour and posted her impressions in Beijing time earlier that morning.

Zilis described it as one of the most surprising and meaningful moments of her life. She discussed topics such as biology and quantum entanglement with Ara, and even asked Ara to create quiz questions to test her learning outcomes. Zilis only answered half of the questions correctly, but Ava patiently explained the remaining ones without dismissing any questions as too foolish.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 27

Someone later asked in the comments if Ara was a voice version, to which Shivon confirmed that it was.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 28

As a builder, late-stage feature delays erode trust in release timelines. I prefer stable text APIs over unstable voice prototypes. Personally, rushing releases creates more technical debt than it saves.

Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points

Seeking $10 Billion in New Financing and Entering the Gaming Sector

The timing of xAI’s latest move feels calculated. Just last Friday, Bloomberg reported that xAI was seeking a new round of financing worth approximately $10 billion, valuing the company at around $75 billion (approximately 545.46 billion RMB).

Existing investors, including Sequoia Capital, Andreessen Horowitz, and Valor Equity Partners, are in talks to participate in this funding round. Since the deal has not yet been finalized, the release of the new model is likely to have a certain impact on this financing round.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 29

If these reports are confirmed, xAI’s fundraising speed is indeed astonishing. At the end of last December, the company had just completed a $6 billion Series C round, valuing it at $51 billion. In less than two months, the valuation jumped by approximately 47%. Looking further back, from the B to C rounds, the valuation doubled within six months. It can be said that xAI, established less than two years ago, has grown into a formidable rival to OpenAI.

With ample funding, xAI announced not only continued model development but also other strategic directions: Betting on the gaming sector by establishing an AI game studio.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 30

Musk first hinted at this last November, complaining that “too many game studios are controlled by large corporations.”

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 31

Now, Musk’s business empire is expanding once again.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 32

I think aI-generated games are a novelty, not a workflow tool. I’m skeptical about the immediate utility for developers. As a builder, this feels like brand expansion rather than technical innovation. The gaming angle is interesting but distant from my daily coding.

One More Thing

A few days before the Grok 3 launch, another dramatic incident sparked heated discussion. An xAI engineer (now a former employee) publicly posted a comparison of Grok 3’s coding abilities against several competitors. Although he clearly labeled this as his personal opinion, ranking his own model, Grok 3, fourth (with the top three spots taken by OpenAI models) caused controversy.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 33

The employee later revealed that the company demanded he either delete the post or be fired, claiming the post exposed Grok 3’s existence. Upon hearing this, the engineer felt it was absurd, noting that everyone already knew about Grok 3, and even shared screenshots of Musk’s previous statements. Facing what he perceived as xAI’s petty behavior, the engineer decided to quit without hesitation, posting a lengthy explanation.

I will maintain my words and dignity, find another job, or start my own business. See you later.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 34

Regarding this incident, Musk later responded that it was “weird,” but no further actions were reported.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 35

More dramatically, due to a dispute over salary payments, the engineer later publicly posted again, tagging Musk:

Please do the right thing.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 36

However, despite having “broken up,” the engineer who worked on Grok 3’s voice mode still set aside past grievances and helped promote Grok 3 multiple times. Moreover, the voice feature that Musk announced would be delayed today was indeed the work of this engineer’s team. Even after leaving, he remains proud of his contribution to the project.

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points — figure 37

That said, what do you think of this version of Grok 3? Once the next generation of GPT is released, can Musk maintain his lead?

Musk's 'Strongest on Earth' Grok 3 Makes Waves, Tops Arena with Over 1400 Points

Author

Scaling Up for Reasoning: The Grok 3 Launch

The Human Element Behind the Compute

Reasoning Capabilities and Benchmark Dominance

Live Demos: Physics and Game Logic

Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points

The Release Process Was Full of Twists; Voice Mode Delayed

Musk’s ‘Strongest on Earth’ Grok 3 Makes Waves, Tops Arena with Over 1400 Points

Seeking $10 Billion in New Financing and Entering the Gaming Sector

One More Thing

Comments

Related News

Latest Headlines