Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click

Media & Embodied AI · Published: Sep 14, 2024 · Elena Volkov · ~7 min read

Author

Elena Volkov · Machine Learning Research Editor

Papers, benchmarks, and training economics — with the caveats spelled out.

Tencent has unveiled GameGen-O, a Transformer-based model designed to simulate game engine functions for the generation of open-world video content. The core technical claim is that this system can produce characters, dynamic environments, complex animations, and interactive events from simple prompts, effectively automating segments of the AAA development pipeline. This would be falsified if the generated assets lack the structural coherence required for actual gameplay logic or if the “interactive control” fails to maintain temporal consistency beyond short clips.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 3

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 2

I followed the release of this joint project between Tencent, Hong Kong University of Science and Technology (HKUST), and the University of Science and Technology of China (USTC). The filing shows an inferred goal to replace specific game development processes using AI. What stood out to me is the breadth of features demonstrated: character creation, environment generation, animation synthesis, event simulation, and multi-modal interaction via text, operational signals, and video prompts.

I think generating a visually coherent character does not equate to creating a playable entity with underlying physics or collision logic. From the paper, the reliance on pre-trained visual priors may limit the model’s ability to invent truly novel game mechanics rather than remixing existing ones.

”Game Studios Enter Their ChatGPT Moment”

The announcement triggered immediate excitement, with some industry figures calling it the “ChatGPT moment for game studios.” Azra Games’ co-founder and CTO explicitly used this comparison, suggesting a paradigm shift in production efficiency. I read these claims with caution; while the visual fidelity is impressive, the gap between a rendered video clip and a functional game engine remains substantial.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 6

The demo highlights several specific capabilities. Users can generate various characters—Western cowboys, astronauts, wizards, guards—with a single click. The system also handles dynamic environments and complex animations from multiple camera perspectives. For instance, it can simulate events like tsunamis, tornadoes, or fires based on text prompts.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 7

GameGen-O supports open-domain generation, meaning it is not limited by style, environment, or scene. It also features interactive control, allowing users to manipulate content via text (e.g., “move left,” “walk toward the dawn”), operational signals, and video prompts.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 11

One caveat: text-to-video consistency over long durations remains a known failure mode for current Transformer architectures. I think the evaluation metrics for “game quality” are absent; visual realism is not the same as interactive playability.

The announcement immediately triggered a wave of excitement on social media (formerly Twitter), with netizens expressing amazement at the potential to democratize game creation. One AI architect netizen declared that this technology could allow ordinary players to create games, bypassing traditional development costs.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 5

While the hype is palpable, I maintain that this is a significant step in generative media, not a replacement for game engines. The ability to simulate engine functions visually is distinct from generating executable code or robust interactive systems. We will need to see independent benchmarks on temporal consistency and control fidelity before accepting the “ChatGPT moment” narrative.

Data Construction and Training Architecture

The core technical claim here is that GameGen-O’s capability stems from a hybrid annotation pipeline using GPT-4o on a curated subset of gaming footage, followed by a two-stage training regimen involving a specialized VAE and an InstructNet adapter. This approach would be falsified if the model failed to generalize beyond the specific “open-world” aesthetic present in its training data or if the GPT-4o annotations introduced systematic biases that degraded temporal consistency.

Using GPT-4o for Data Annotation

To develop this model, the team stated they primarily undertook two tasks:

Constructing a proprietary dataset, OGameData, using GPT-4o for data annotation
Undergoing a two-stage training process

Specifically, the team first proposed a dataset construction pipeline.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 14

The team collected 32,000 raw videos from the internet. These videos were sourced from hundreds of open-world games, ranging in length from minutes to hours, and covering genres such as role-playing, first-person shooters, racing, and action-puzzle games.

Human experts then identified and filtered these videos, resulting in approximately 15,000 usable clips.

Next, the filtered videos were segmented into clips using scene detection technology. These video segments underwent strict sorting and filtering based on aesthetics, optical flow, and semantic content.

Subsequently, over 4,000 hours of high-quality video clips, with resolutions ranging from 720p to 4K, were meticulously annotated using GPT-4o.

To enable interactive control, the team selected the highest-quality segments from the annotated dataset and performed decoupled labeling.

This labeling design describes changes in the state of clip content, ensuring that the training dataset is more refined and interactive.

Regarding this collaborative approach between human experts and GPT-4o, some netizens noted:

This represents a form of recursive self-improvement (human experts ensure annotation accuracy and help GPT-4o improve itself through feedback mechanisms).

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 15

From the paper, relying on GPT-4o for annotation introduces a risk of homogenized descriptions that may not capture nuanced game mechanics. One caveat: the “decoupled labeling” method is promising, but its effectiveness depends entirely on the precision of the state-change definitions. I suspect the 15,000 clip limit restricts the model’s ability to handle long-horizon temporal dependencies common in RPGs.

After completing data preparation, the team trained GameGen-O through two processes: base pre-training and instruction tuning.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 16

In the base training phase, the GameGen-O model used a 2+1D VAE (Variational Autoencoder, such as Magvit-v2) to compress video clips.

To adapt the VAE for the gaming domain, the team fine-tuned the VAE decoder specifically for this field.

The team adopted a mixed training strategy with varying frame rates and resolutions to enhance generalization across different frame rates and resolutions.

Additionally, the model’s overall architecture followed the principles of the Latte and OpenSora V1.2 frameworks.

By employing masked attention mechanisms, GameGen-O acquired dual capabilities: text-to-video generation and video continuation.

The team explained:

This training method, combined with the OGameData dataset, enables the model to stably and high-quality generate open-domain video game content, laying the foundation for subsequent interactive control capabilities.

Following this, the pre-trained model was frozen, and fine-tuning was performed using a trainable InstructNet. This allows the model to generate subsequent frames based on multimodal structural instructions.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 17

InstructNet is primarily designed to accept various multimodal inputs, including structured text, operational signals, and video prompts.

During the adjustment of the InstructNet branch, current content was used as a condition, establishing a mapping relationship between current clip content and future clip content under multimodal control signals.

The result is that during inference, GameGen-O allows users to continuously generate and control the next generated segment based on the current one.

Currently, GameGen-O has created an official GitHub repository, although the code has not yet been uploaded.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 18

Those interested can bookmark it for now.

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click — figure 4

Tencent Unveils GameGen-O: AI Model That Generates 'Black Myth'-Style Game Videos with One Click

Author

”Game Studios Enter Their ChatGPT Moment”

Data Construction and Training Architecture

Using GPT-4o for Data Annotation

Comments

Related News

Latest Headlines