GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level Sets New Paradigm for Evaluating Multimodal AI

Media & Embodied AI · Published: May 16, 2025 · Amara Okonkwo · ~16 min read

Author

Amara Okonkwo · Robotics & Embodied AI Editor

Humanoids, industrial robots, and what is demo vs. deployed.

I think benchmarks are marketing tools until they run on dirty hardware. I’ve seen “comprehensive” leaderboards hide critical failure modes in edge cases. In the field, 700 tasks mean nothing if the robot can’t navigate a cluttered kitchen.

The Gap Between Demo Scores and Deployed Reality

I watched the industry pivot from single-modality models to systems that juggle images, text, audio, and video simultaneously. It’s impressive on paper, but my skepticism kicks in when we talk about evaluation. We’ve moved past the “first half” of AI competitions into a phase where designing scientific evaluation mechanisms has become the core key to determining victory, as OpenAI researcher Shunyu Yao recently noted.

The problem with traditional metrics is simple: aggregating scores across multiple tasks creates a false sense of competence. Just because a model nails specific benchmarks doesn’t mean it’s approaching human-level intelligence across all domains. We need frameworks that actually reflect real-world robustness, not just test-set memorization.

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 2

This is where the paper On Path to Multimodal Generalist: General-Level and General-Bench, accepted for a Spotlight presentation at ICML ’25, enters the fray. The General-Level team proposes General-Level, a new evaluation framework, alongside General-Bench. They aren’t just tweaking existing metrics; they are trying to solve the foundational issue of how we measure multimodal generalization.

The project team has already deployed this infrastructure within the community. They constructed an ultra-large-scale benchmark and what they call the industry’s most comprehensive multimodal generalist model Leaderboard. It covers over 700 tasks, five common modalities, 29 domains, and more than 320,000 test data points.

What I watch for is scale is impressive, but does it catch the robot getting stuck on a specific texture? I want to see how these scores hold up when the lighting changes or the object moves.

General-Level Evaluation Algorithm: Five-Tier Rank System and Synergy Effects

The General-Level evaluation framework introduces a five-tier rank system, measuring the generalist capabilities of multimodal models through a “rank promotion” mechanism similar to gaming tiers.

I think gaming metaphors don’t fix broken safety protocols in deployment. I’m skeptical that synergy scores predict real-world reliability.

The core of General-Level assessment lies in Synergy, which refers to the model’s ability to transfer knowledge learned from one modality or task to enhance performance in another. In simple terms, this is the effect where 1+1 > 2.

Model ranks range from low to high as follows: Level-1 Specialist Expert, Level-2 Generalist Rookie (No Synergy), Level-3 Task-Level Synergy, Level-4 Paradigm-Level Synergy, and Level-5 Full Cross-Modal Total Synergy. A higher rank indicates stronger “general intelligence” demonstrated by the model and a deeper level of synergy achieved.

General-Level determines a model’s rank by examining synergistic effects at different levels:

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 3

Level-1 Specialist (Specialist)
- This tier includes current specialized models for individual tasks, typically SOTA (State-of-the-Art) models fine-tuned to the extreme on specific datasets or tasks.
Level-2 Entry-Level Generalist (Generalist, No Synergy)
- Reaching Level-2 means the model begins to possess “one specialty with multiple capabilities,” supporting various modalities and tasks, but has not yet demonstrated synergistic gain effects.

In the field, a generalist that can’t transfer knowledge is just a slower specialist.

Level-3 Task-Level Synergy
- Promotion to Level-3 requires the model to exhibit task-level synergistic improvements. This means that through multi-task joint learning, the model’s performance on certain tasks surpasses the SOTA of specialized models for those specific tasks.
Level-4 Paradigm-Level Synergy
- To advance to Level-4, a model must demonstrate cross-paradigm synergy, creating synergistic effects between the two major task paradigms: “understanding” and “generation.” This rank signifies that the model has begun to possess “integrated generation-understanding” reasoning capabilities, enabling knowledge transfer across differences in task formats.
Level-5 Full Cross-Modal Total Synergy (Cross-modal Total Synergy)
- This is the highest rank in the General-Level evaluation, marking that a model has achieved comprehensive synergy across modalities and tasks. It represents the ideal state of Artificial General Intelligence (AGI).

However, as of now, no model has reached Level-5.

Level-5 represents the ultimate goal on the path to AGI. Once a model enters this rank, it may signal that generalist AI has taken a critical step toward “Artificial General Intelligence.”

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 4

Overall, through this five-tier rank system, General-Level elevates the evaluation perspective from merely stacking task scores to examining the model’s internal knowledge transfer and integration capabilities.

This ranking system ensures objective quantification while providing the industry with a roadmap for progression from specialists to generalists, and finally to “all-rounders.”

The General-Bench Exam: A Stress Test for Multimodal Generalists

General-Bench positions itself as the definitive stress test for today’s multimodal models, claiming to be the largest in scale and most comprehensive in scope currently available. It isn’t just a simple quiz; it’s a panoramic evaluation system designed to measure breadth, depth, and complexity simultaneously.

What I watch for is benchmarks like this often ignore the cost of inference required to run them. I think free-form responses are great until you need deterministic outputs for safety.

In terms of breadth, General-Bench covers five core modalities—images, video, audio, 3D, and language. This achieves full-chain modality coverage from perception to understanding, and finally to generation. In the dimension of depth, it includes numerous traditional understanding tasks (such as classification, detection, question answering) alongside rich generation tasks (image, video, audio, description).

Notably, all tasks support free-form responses. They are not limited to multiple-choice or true/false questions. Instead, they are objectively evaluated based on the open metrics native to each task, filling a long-standing gap in industry evaluation blind spots.

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 5

In terms of data scale, General-Bench aggregates over 700 tasks and more than 325,000 samples, subdivided into 145 specific skills. It comprehensively covers core capability points across visual, auditory, and linguistic modalities. Behind these skills, General-Bench spans 29 interdisciplinary knowledge domains, encompassing natural sciences, engineering, healthcare, social sciences, and humanities. From image recognition to cross-modal reasoning, from speech recognition to music generation, and from 3D models to video understanding and generation, it covers everything.

In the field, covering 29 domains is impressive; executing them reliably in a factory is another story. I worry about the latency of running free-form generation across five modalities.

Furthermore, General-Bench pays special attention to model performance in higher-order capabilities such as content identification, commonsense reasoning, causal judgment, sentiment analysis, creativity, and innovation. This provides a multi-dimensional and three-dimensional evaluation space for generalist AI models. In short, General-Bench is an unprecedentedly challenging comprehensive multimodal exam paper, comprehensively testing the breadth, depth, and integrated reasoning capabilities of AI models across modalities, task paradigms, and knowledge domains.

Currently, the total number of task samples in General-Bench has reached 325,876 and will continue to grow dynamically. This openness and sustainable updating ensure that General-Bench possesses long-term vitality, capable of continuously supporting the R&D and evolution of multimodal generalist AI.

With the General-Level evaluation standards and dataset in place, a transparent leaderboard is needed to present model evaluation results and rankings. This is precisely the function of the project’s Leaderboard system.

To balance comprehensive evaluation with participation barriers, the Leaderboard designs a multi-level Scope decoupling mechanism (Scope-A/B/C/D).

Different Scopes act as sub-leaderboards with varying ranges and difficulties, allowing models with different capabilities to showcase their strengths. Covering everything from an “all-around champion competition” to “single-skill competitions,” this ensures that top-tier generalist models have a stage to compete for the all-around crown, while ordinary models can choose appropriate scopes for comparison, lowering the barrier for community participation.

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 6

What I watch for is sub-leaderboards let vendors cherry-pick metrics that hide their weaknesses. I think a model can be a “generalist” in a vacuum but useless on a factory floor.

Scope-A: Full-Spectrum Hero Board

This is the most difficult and widest-reaching main leaderboard: Participating models must undergo the full test of the General-Bench collection, meaning a complete evaluation covering all supported modalities and all domain tasks.

Scope-A aims to select truly all-around multimodal foundation models, testing their comprehensive strength in fully complex scenarios.

Scope-B: Unified Modality Hero Board

Scope-B includes several sub-leaderboards, each targeting a specific modality or a limited combination of modalities.

Specifically, Scope-B divides into seven parallel boards: four are single-modality boards (e.g., pure vision, pure audio, pure video, pure 3D), and the other three are modality-combination boards (e.g., image + text, video + text, etc.).

Participating models only need to complete multi-task evaluations within their selected modality range, without involving data from other modalities.

In the field, isolating modalities ignores how robots actually perceive the world in real time. What I watch for is pure vision scores don’t account for sensor noise or lighting changes in deployment.

Scope-C: Understanding/Generation Hero Board

Scope-C further subdivides evaluation into two major paradigms: understanding tasks and generation tasks, with separate boards for each modality. Specifically, within image, video, audio, and text modalities, there are separate “Understanding Ability Boards” and “Generation Ability Boards,” totaling eight boards.

Scope-C evaluation emphasizes cross-paradigm transfer capabilities within the same modality: For example, if a model performs excellently on the visual understanding board, it indicates shared knowledge capabilities across various understanding tasks like visual classification and detection; high scores on the visual generation board imply generalist capabilities across various generation tasks (description, drawing).

By limiting the scope of task paradigms, Scope-C has lower resource requirements (three-star difficulty), making it suitable for lightweight models or teams with limited resources.

Scope-D: Skill Specialty Board

This is the finest-grained and lowest-barrier category of boards. Scope-D clusters tasks from General-Bench by specific skills or task types, creating a separate board for each small category.

For example: “Visual Question Answering (VQA) Board,” “Image Caption Generation Board,” “Speech Recognition Board,” “3D Object Detection Board,” etc., with each board covering a group of closely related tasks.

Participating models can submit results targeting only one type of skill, allowing them to compete with other models in their narrowest area of expertise.

This skill-board mechanism encourages gradual model development: first achieving excellence in single-point skills, then progressively challenging broader multi-task and multi-modal evaluations.

The Leaderboard link is available at the end of this article.

How to Get Your Model Evaluated: Process and Fairness

I’ve watched enough demo reels where models claim “general intelligence” only to fail when asked to parse a messy PDF. The General-Level project attempts to curb that hype with a rigid submission pipeline, but the real question is whether these controlled environments reflect what happens in production.

Whether you’re an academic lab or an industrial team, getting your multimodal model onto this leaderboard requires following a strict protocol designed to prevent gaming the system. Here is how the process actually works.

1. Select Scope and Download Data

First, you must identify the specific Leaderboard ID that matches your model’s capabilities. Once selected, download the closed-set test data from the official source. Crucially, this dataset contains only input data; standard answers are withheld for formal evaluation. The project does provide an open development set for debugging, allowing you to verify output formats locally before committing to a submission.

I think debugging on open sets rarely predicts failure modes on closed, adversarial benchmarks.

2. Run Local Inference

With the test data in hand, run inference using your local model to generate outputs. Be aware that leaderboards often contain multiple task types; your result files must strictly adhere to the official directory structure and formatting guidelines. I always check the submission documentation twice before uploading—getting the zip format wrong is a common point of failure. Name your file [Model Name]-[Leaderboard ID].zip and prepare for upload.

3. Submit Results and Metadata

Upload the ZIP file to the Leaderboard portal. You must also provide necessary model information, including name, parameter scale, and a description, along with a contact email for backend processing. If you want visibility, your team can publish a technical report alongside the submission to explain your model’s highlights to the community.

4. Evaluation and Ranking

The system scores outputs in the backend, calculating metrics per task and aggregating them into a General-Level rank score. Because both the closed-set answers and scoring scripts are confidential, submitters cannot peek at unpublished data, ensuring evaluation integrity. Once complete, the Leaderboard updates in real time, displaying model name, modality category, per-modality scores, total score, rank level, and submission date. You can sort by tier to see which models have reached collaborative levels like Level-3 or Level-4.

In the field, real-time leaderboards often hide the latency costs required to achieve those high scores.

Fairness Mechanisms: Preventing Cheating

To maintain fairness and authority, the project enforces several strict rules:

Closed Testing: All datasets are closed sets. Models must not use this test data for training or fine-tuning. This is enforced via agreements and data monitoring. Since developers cannot know correct answers beforehand, the credibility of the scores remains intact.
Submission Frequency Limits: Users may submit a maximum of two times within 24 hours and four times within seven days. New submissions are blocked while previous evaluations are pending. These limits prevent reverse-engineering standard answers or overfitting to the closed set through trial-and-error speculation.
Unified Evaluation Environment: All submissions are scored in an environment standardized by the organizers. Regardless of your framework or inference acceleration, final scores use the same metric system and convert into tier scores based on the General-L

The Reality of the Benchmark

I read through the release notes for the new General-Level Leaderboard, and what stood out to me was the promise of a standardized metric. They claim this provides an open arena where algorithms can be tested against industry methods with closed evaluation to ensure credibility. It sounds like they’ve built a ruler that everyone agrees on, which is rare in this space.

What I watch for is standardized metrics are useful until the test set leaks into training data. I think closed evaluations hide the messy edge cases robots face daily. In the field, a leaderboard score doesn’t mean the robot can lift a box without dropping it.

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 7

The Reality on the Ground vs. The Leaderboard Hype

I read through the new General-Level rankings, and what stands out isn’t just the scores—it’s the gap between a model that looks good in a paper and one that actually works when you need it to. We’ve seen enough demo videos where models claim generalist prowess only to fail on basic unit economics or safety constraints.

To date, the leaderboard has recorded results from over 100 multimodal models, revealing their hierarchy in generalist capabilities according to General-Level standards.

In the initial batch of closed-set evaluation leaderboards released, there were significant differences in overall performance among models, even overturning common perceptions regarding the capability rankings of mainstream multimodal large language models.

Looking across the leaderboard, tier distributions are beginning to take shape.

Level-2 (No Collaboration)

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 8

The most prevalent tier in the leaderboard is Level-2, which includes heavyweight closed-source models such as GPT-4V. Many other commonly used open-source models are also listed here.

These models excel in their wide range of supported tasks, covering almost all evaluation tasks; however, they rarely surpass single-task State-of-the-Art (SOTA) performance on any specific task. Therefore, General-Level rates them as Level-2 generalists, representing a “passing grade across all subjects” level.

Notably, although models like GPT-4V are top-tier commercial products, they did not receive standout scores because they were not specifically optimized for the evaluation tasks and thus failed to demonstrate collaborative gains.

Conversely, some open-source models have entered Level-2 through multi-task training, achieving comprehensive performance across areas, such as SEED-LLaMA and Unified-IO. Models in this tier primarily excel in image modalities, with average single-modality scores generally ranging between 10 and 20 points, indicating significant room for improvement.

The current top three models at Level-2 are Unified-io-2-XXL, AnyGPT, and NExT-GPT-V1.5.

What I watch for is gPT-4V’s presence here confirms that broad capability doesn’t equal specialized utility in real-world deployments.

Level-3 (Task Collaboration)

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 9

The number of multimodal large models gathered at this level is significantly lower than at Level-2. These models have defeated specialized models in several tasks, demonstrating performance leaps brought about by collaborative learning.

Many new models released after 2024 have advanced to this tier, including open-source models such as Sa2VA-26B, LLaVA-One-Vision-72B, and the Qwen2-VL-72B series. These models typically possess hundreds of billions of parameters and undergo massive multimodal, multi-task training, allowing them to surpass traditional single-task SOTA results on certain benchmarks.

This proves the value of collaborative effects: unified multi-task training enables models to learn more general representations, mutually enhancing performance across related tasks.

Conversely, some closed-source large models, such as OpenAI’s GPT-4o, GPT-4V, and Anthropic’s Claude 3.5, do not rank highly at Level-3.

The overall average score range for Level-3 models is lower than that of Level-2, indicating a more difficult scoring environment at this level.

I think high parameter counts are impressive on paper, but they don’t guarantee the robustness required for edge deployment.

Level-4 (Paradigm Collaboration)

GPT-4V Only Reaches Level-2? Global First Multimodal Generalist Ranking Released, General-Level S… — figure 10

Models reaching this tier remain rare.

According to the Leaderboard (as of the evaluation date, December 2024), only a very small number of models are rated as Level-4, such as large-scale prototype open-source models like Mini-Gemini, Vitron-V1, and Emu2-37B.

These models have made breakthroughs in cross-paradigm reasoning, possessing excellent understanding and generation capabilities while integrating the two seamlessly.

For example, the Mini-Gemini model leads in both image understanding and generation, ranking prominently on the Leaderboard’s paradigm collaboration scores.

The emergence of Level-4 signifies progress toward true cross-modal reasoning AI. However, the average score for current Level-4 models is very low. This reveals the immense challenge of building AI with comprehensive paradigm synergy: achieving dual breakthroughs in both multimodal understanding and generation while maintaining balance is extremely difficult.

In the field, prototype status means these systems are likely too unstable or costly to run outside a controlled lab environment.

Level-5 (Full Modality Total Collaboration)

This tier remains empty to date, with no model having achieved it.

I read the filing and saw that true AGI is still a long way off; we are nowhere near full-modal collaboration yet.

This is not surprising, as surpassing experts across all modalities and tasks while simultaneously enhancing language intelligence currently exceeds the capabilities of existing technology.

The General-Level team speculates that the next milestone may come from a “multimodal version” of GPT-5, which could potentially demonstrate the first signs of full-modal collaboration, thereby changing the situation where Level-5 remains unclaimed.

I follow the release and see no evidence that GPT-5 will solve physics or safety constraints overnight.

However, until that day arrives, the Level-5 position on the Leaderboard will remain vacant, reminding us that we are still quite far from achieving true AGI.

The launch of the current Leaderboard has sparked enthusiastic responses within the AI research community. Many researchers believe that such a unified, multi-dimensional evaluation platform is urgently needed in the multimodal field: it is not only unprecedented in scale (covering 700+ tasks) and comprehensive in structure (with tiers and sub-categories), but also open and transparent, providing the industry with a common reference for progress.

On social media and forums, discussions have arisen regarding the leaderboard results: some are surprised that open-source models like Qwen2.5-VL-72B can defeat many closed-source giants, proving the potential of the open-source community; others analyze GPT-4V’s shortcomings in complex audio-visual tasks to explore how to address them.

I benchmarked similar setups and know that beating a closed model on paper doesn’t mean it works in a warehouse.

Leaderboard data is also being used to guide research directions: it is clear which tasks are weak points for most models and which modality combinations have not yet been well resolved.

It can be anticipated that as more models join the leaderboard, it will continue to update. This is not merely a competition but an ongoing accumulation of valuable scientific insights.

The launch of the General-Level evaluation framework and its Leaderboard marks a new stage in multimodal generalist AI research. As hoped by the authors in their paper, the assessment system constructed by this project will serve as fundamental infrastructure, helping the industry measure the progress of general artificial intelligence more scientifically.

Through unified standard tier evaluations, researchers can objectively compare the strengths and weaknesses of different models to identify directions for further improvement; through large-scale multi-task benchmarks, they can comprehensively examine a model’s shortcomings in various domains, accelerating problem discovery and iterative improvement. All of this is of significant importance for driving the next generation of multimodal foundation models and advancing toward true AGI.

More importantly, the General-Level project adheres to an attitude of open sharing, welcoming broad community participation in co-construction. Whether you have new model proposals or unique datasets at hand, you can participate: submit your model results to join the leaderboard and compete with top global models; or contribute new evaluation data to enrich the task diversity of General-Bench.

Every dataset added will receive acknowledgment on the official website homepage and be cited in technical reports.

Project Homepage: https://generalist.top/
Leaderboard: https://generalist.top/leaderboard
Paper Address: https://arxiv.org/abs/2505.04620
Benchmark: https://huggingface.co/General-Level