Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Eliminates Prompt Engineering

Models & Benchmarks · Published: Sep 20, 2024 · David Kowalski · ~10 min read

Author

David Kowalski · Developer Tools & Agents Editor

Coding agents and IDE workflows tested the way working teams use them.

I’ve spent the last few weeks wrestling with prompt engineering, trying to coax complex reasoning out of models that usually just guess the next token. The promise of OpenAI’s o1 is seductive: stop tweaking prompts and start letting the model “think.” But as NVIDIA’s Jim Fan points out, we are shifting from training-phase investments to inference-time computation—a paradigm he calls Inference Law.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 2

Fan references Rich Sutton’s The Bitter Lesson, arguing that learning and search are the only infinite expanders of AI potential. Right now, the industry is pivoting hard toward that second lever: search through inference. By investing resources into this process, models gain a more complete thinking trajectory, yielding qualitative leaps rather than just incremental accuracy gains.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 3

This isn’t just a Silicon Valley trend. In China, Zhou Hongyi, founder of 360, is applying this same “slow thinking” philosophy to his technical architecture. He advocates for multi-model collaboration, urging large models from different vendors to “huddle together for warmth.” This approach suggests a viable path for domestic models to catch up with OpenAI’s lead by leveraging collective intelligence rather than relying on a single monolithic brain.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 4

Observing Large Model “Slow Thinking” Through o1

While OpenAI keeps the specific mechanics of o1’s thinking process under wraps, it is clear that Chain of Thought (CoT) is central. In its report, OpenAI notes that CoT allows models to recognize errors, break down complex steps, and try alternative methods, significantly boosting reasoning capabilities.

At ICLR this year, Denny Zhou from Google Brain’s inference team, alongside Yu Ma from Tsinghua and Stanford, unveiled the infinite potential of Chain of Thought in a new paper.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 5

At its core, CoT mirrors what Daniel Kahneman described in Thinking, Fast and Slow as “System 2”—complex, conscious reasoning—contrasted with the intuitive “System 1.” o1’s performance validates that this human cognitive model applies to large language models. However, Zhou Hongyi warns against separating these systems; they should coexist and cooperate, much like in the human brain.

Zhou believes o1 likely follows a “Dual Process Theory,” where fast and slow systems collaborate. As a pioneer in “slow thinking” and multi-system collaboration, 360 announced plans at the ISC.AI conference to build such a system. They have already implemented this via an agent framework that transitions models from fast to slow thinking, launching two products: 360 AI Search and 360 AI Browser.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 6

I think multi-model agent frameworks feel like the pragmatic bridge until single models master deep reasoning. I’m skeptical that “slow thinking” scales efficiently for real-time developer workflows. As a builder, vendor collaboration is a smart hedge against being locked into one proprietary inference engine.

Orchestrating a Swarm of Models

I read how 360 AI Search structures its in-depth answers around heavy lifting: an in-depth answer may involve 7 to 15 calls to large models. The workflow starts by using an intent classification model to identify the user’s intent, then uses a task routing model to decompose the problem into “simple tasks,” “multi-step tasks,” and “complex tasks.” Finally, it constructs an AI workflow for collaborative operation.

For example, translating classical Chinese poetry into English triggers multiple models like translation and reflection agents to divide labor. The latest version strengthens this by making multi-model collaboration an independent response mode. Three different models play distinct roles: the Expert generates the initial answer, the Reflector checks the response, and the Summarizer provides the final answer.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 7

In one instance, the Expert model Kimi identified key points but lacked clarity. Under the guidance of the reflection model 360 ZhiNao, Doubao re-summarized the content to provide a direct and precise solution. This mode integrates fast-slow thinking and cross-validation among different models to improve performance.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 8

Personally, routing logic is the unsung hero of reliable agent chains. I think cross-validation adds latency but reduces hallucination risk significantly.

The 54-Model Browser Ecosystem

In another product, the 360 AI Browser, 54 large models from 16 vendors have gathered. This aggregation enables capabilities traditional browsers cannot offer. For instance, it can summarize tens of thousands of words in English academic papers within 10 seconds, allowing users to ask detailed questions about specific points.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 9

It can immensely translate PDF documents, with original text and translation scrolling synchronously for easy comparison. It also acts as an “AI Efficiency Expert,” helping to summarize online videos in minutes, draw mind maps based on video structure, and analyze creative styles. These analysis functions apply not just to online content but also to local files.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 11

More conveniently, the 360 AI Browser has a mobile version, allowing users to leverage AI-assisted browsing on their phones anytime.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 13

As a builder, aggregating 54 models in one UI is a massive integration effort. Personally, sync-scroll translation is a practical feature for technical readers.

Unified Dispatch via CoE Architecture

The AI Assistant (bot.360.com), which has joined the 360 AI Browser and is also based on the CoE architecture, can automatically dispatch the most suitable large model according to task type and model strengths. Users can directly converse with 54 large models or more powerful hybrid models without switching platforms. The AI Assistant also supports “multi-model collaboration.” Use

The Strategy Behind the Swarm: Why 360 is Betting on Collaboration Over Single-Model Dominance

I think this approach shifts the burden from prompt engineering to model orchestration, which simplifies my daily workflow significantly.

The core philosophy driving both 360 AI Search and the 360 AI Browser is clear: rather than relying on a single “smartest” model, they are leveraging collective strength. By engaging in what can be described as “slow thinking,” 360 allows models to collaborate—effectively huddling together—to produce results that no single entity could achieve alone. This creates a dynamic where many hands make light work of complex tasks.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 14

This collaborative architecture isn’t just about better user experiences; it serves as a critical incentive for large model developers. Given the enormous R&D investments required to build these models, sufficient user adoption is essential to recoup costs. By integrating dozens of vendors into its ecosystem, 360 has opened access to its massive base of 1 billion users. This exposure helps justify the heavy lifting done by developers and fosters a virtuous cycle where AI applications and underlying models promote each other’s development.

I appreciate that this model gives smaller players a fair shot at visibility without requiring them to beat GPT-4 on day one.

The breadth of participation underscores the appeal of this strategy. Major tech giants like Alibaba, Tencent, and Baidu have joined the 360 AI architecture, alongside the “Little Six Tigers” of specialized large model startups. This coalition of more than a dozen vendors demonstrates that the platform offers tangible value beyond just API access—it provides a distribution channel that is hard to ignore in today’s competitive landscape.

As a builder, for us building agent workflows, this proves that routing tasks across specialized models is becoming a viable production pattern.

To facilitate this ecosystem, 360 has launched the “Model Arena” within its AI Assistant (bot.360.com). This platform supports head-to-head competitions among 54 large model products, featuring modes such as team battles, anonymous showdowns, and random matches. It provides domestic large models with a venue to learn through competition and receive direct user feedback, fostering a more proactive and enterprising atmosphere within the industry.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 15

Looking ahead, the scope of this collaboration is set to expand. The 360 team has indicated plans to release versions where five or more models collaborate simultaneously to complete tasks. This evolution suggests a move away from simple pairwise comparisons toward complex, multi-agent orchestration as the standard for handling nuanced real-world problems.

The “Elimination” of Prompt Engineering

As a developer, I’ve spent years tweaking prompts until the output finally aligned with my intent. It’s tedious, repetitive work that feels like speaking a foreign language just to get basic tasks done. That friction is exactly what 360’s new CoE (Collaboration-of-Experts) architecture claims to solve by automating the “how” so we can focus on the “what.”

Technically, this bridge between concept and product relies on aggregating a wide array of large language models and expert models. It achieves an organic integration of “fast thinking” and “slow thinking” through chain-of-thought reasoning and what they call “multi-system collaboration.”

The approach mirrors o1’s path but goes deeper in its inclusivity. While o1 ultimately leans on OpenAI’s proprietary models, CoE is designed to be inclusive, pulling from a broader ecosystem of LLMs and specialized experts.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 16

△ Schematic diagram of the CoE architecture

Crucially, this system incorporates many expert models with parameters in the billions or even smaller. This design makes the entire system more intelligent, allowing it to deliver high-quality responses while saving inference resources and improving response speed.

Shortly after releasing the CoE architecture, their hybrid large model capabilities—which drew strengths from various sources—surpassed GPT-4o (then considered the strongest).

In tests across 12 metrics such as translation and writing, this hybrid large model achieved a comprehensive score of 80.49, outperforming GPT-4o’s score of 69.22. Moreover, in all categories except coding, it surpassed GPT-4o.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 17

The CoE architecture embraces all models, going further than OpenAI on the path to open collaboration.

Whether it is OpenAI’s o1 or 360’s CoE, both point toward a new trend in the development of large language models: Complex manual processes will be automated. Specifically within the context of large models, this means the “elimination” of prompt engineering.

At first glance, this may seem counterintuitive because, when using large models, the quality of prompts has a decisive impact on generated content; its importance is self-evident. However, upon closer reflection, there is no contradiction: AI applications like large language models ultimately exist to serve humans.

Prompt engineering, conversely, requires humans to adapt to the way models work—a reversal of priorities. Therefore, while prompt engineering is undoubtedly important, it should not become an obstacle for ordinary users interacting with large models.

The solution lies in treating prompt design as just another task within a chain-of-thought process, delegating it to the large model itself. In this mode, the essence of prompt engineering remains intact but gradually fades from the user’s perspective, creating a sense of “disappearance.”

This approach also reflects 360’s vision for the future development of AI: Achieving inclusive access to AI for more people, ensuring that large models are no longer confined to elite circles (“high temples”) but become as ubiquitous and essential as household lights.

Personally, automating prompt design could save hours of debugging per week. I’d want to see how this handles edge cases without human intervention. I think resource efficiency matters if we’re running these models locally.