SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human

Industry & Startups · Published: Jul 29, 2025 · Lin Mei Huang · ~10 min read

Author

Lin Mei Huang · Multimodal & Media AI Editor

Image, video, and audio models — rights, limits, and creative workflows.

Creators are finally facing the moment where their workflows aren’t just assisted by software, but replaced by it. The shift from “tool” to “human-like agent” isn’t a metaphor anymore; it’s a business model pivot that demands immediate attention from anyone selling creative services.

On July 27, 2025, SenseTime unveiled its new SenseNova V6.5 large model system at the WAIC 2025 Large Model Forum in Shanghai. Hosted by the Artificial Intelligence Committee of the All-China Federation of Industry and Commerce (ACFIC) and organized by SenseTime under the theme “Boundless Love · Shaping the Future,” this release marks a significant breakthrough in multimodal foundation models. The company claims this upgrade enables AI to leap from being merely a “productivity tool” to becoming actual “productivity.” Alongside this, SenseTime’s core product, SenseTime Little Raccoon (Xiao Huan Xiong), has completed an agent-based upgrade, signaling a move toward autonomous interaction.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 2

I followed the narrative carefully because it touches on a historical tension in our industry. In 1950, Alan Turing defined AI as “human-like capabilities” through the “Imitation Game.” For decades, practical AI remained confined to the category of “tools,” often experiencing periods of stagnation where its utility was limited and transactional. However, I read that in the era of large models, AI is gradually approaching the boundaries of Artificial General Intelligence (AGI). This shift is driven by breakthroughs in multimodal fusion capabilities—allowing systems to perceive, reason, and interact with a complexity that blurs the line between software and agent.

I think autonomous agents may bypass traditional freelance intermediaries entirely. For creators, licensing agreements must now account for AI-generated derivative works. On licensing, workflow friction increases as humans manage rather than create directly.

Xu Li, Chairman and CEO of SenseTime and the first rotating chairman of the Presidium of the ACFIC Artificial Intelligence Committee, stated: “SenseTime has always sought to understand the essence of artificial intelligence. By leveraging technological innovation to unlock maximum intelligence, we are driving AI’s transition from a ‘tool’ to a ‘human,’ becoming true productivity.”

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 3

SenseNova V6.5: A Deeper Understanding of Visual Logic

I watched SenseTime roll out the SenseNova V6.5 multimodal foundation model, and what stands out is how it attempts to bridge the gap between seeing and thinking. This isn’t just another incremental update; it’s a structural shift toward “visual thinking” in large language models. The release highlights three core breakthroughs: stronger reasoning via image-text interleaved chain-of-thought (matching benchmarks like Gemini 2.5 Pro and Claude 4-Sonnet), high efficiency through optimized architecture, and robust agent capabilities for end-to-end data analysis.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 4

What stood out to me is the claim that SenseNova V6.5 is now the first commercial-grade large model in China to implement image-text interleaved chain-of-thought technology. By moving beyond standard text-based reasoning, they are trying to replicate how humans integrate visual and logical thought. As the saying goes, “a picture is worth a thousand words,” yet most current models still rely heavily on linguistic inference for their reasoning processes, leaving gaps in graphical and spatial understanding.

I think visual thinking tools could streamline complex design workflows for creative professionals. For creators, interleaved reasoning may reduce the friction of explaining visual concepts to AI assistants.

The technical hurdle here is significant: constructing multimodal chains of thought requires generating images as nodes within a reasoning chain, which is far more difficult than pure text-based logic. SenseTime’s R&D team addressed this by first building seed data based on an understanding of human thinking processes. After supervised fine-tuning (SFT), the model learned to think with interleaved text and images, and subsequent reinforcement learning rounds significantly boosted its multimodal reasoning capabilities.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 5

Simultaneously, SenseTime has overhauled the fusion architecture of its multimodal models. They employ a lighter visual encoder and a deep, narrow backbone model to promote early cross-modal integration. This design allows visual representations to align with language during the early stages of feedforward computation, resulting in more efficient perception.

On licensing, faster inference speeds mean less waiting time for creators iterating on complex multimodal prompts. I think improved cost-effectiveness could lower barriers for small studios adopting advanced AI agents.

The result is a model that has increased pre-training throughput by over 20%, reinforcement learning efficiency by 40%, and inference throughput by more than 35%. Compared to SenseNova V6.0, the cost-effectiveness of SenseNova V6.5 has tripled, achieving what they describe as an optimal balance between performance and cost.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 6

SenseTime’s ‘Daily New 6.5’ Upgrade Marks AI’s Leap from Tool to Human

The Shift from Auxiliary Tool to Autonomous Agent

The creative stack is losing ground if we treat AI merely as a text generator. We are watching the industry pivot toward agents that can actually do work, not just suggest it. SenseTime’s latest move with its “SenseNova V6.5” model and the Raccoon agent signals this transition: moving from passive assistance to active productivity in office scenarios.

Large language models have long served as auxiliary tools for professionals. However, relying solely on text-based LLMs is insufficient to elevate AI from a mere “tool” to an autonomous “agent.” Human daily tasks inherently involve processing multimodal information—text, images, video, and web pages. The key transition from a productivity tool to actual productivity lies in the ability to input, process, and output this multimodal data seamlessly.

Leveraging the powerful multimodal data analysis capabilities of its “SenseNova V6.5” model, Sensetime’s Raccoon agent has undergone a comprehensive upgrade. It can now handle complex multimodal inputs, perform deep fused analysis across modalities, and deliver professional visual outputs. This evolution establishes “AI productivity in office scenarios,” enabling AI to leap from being a “productivity tool” to becoming actual “productivity.”

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 7

For creators, agents that handle complex workflows reduce the friction of manual data entry for creators. On licensing, we must ensure these agents respect copyright when parsing and reusing visual content from sources like Douyin.

Benchmarking Performance Against Global Giants

Simultaneously, Sensetime Raccoon maintains world-leading capabilities in complex data analysis. In comprehensive customer scenario tests, Raccoon achieved performance levels comparable to Claude 4 Opus, an international benchmark for data analysis and AI agents, significantly outperforming models such as OpenAI’s o3. Specifically, its accuracy approaches 100% in tasks involving time-series calculations, data matching, mathematical computations, and anomaly detection.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 8

In real-world office environments, data input formats are highly complex. In data analysis scenarios, documents come in various forms such as screenshots, Word files, and PDFs, with structured information and tables accounting for only about 70% of the content. Even seemingly basic Excel spreadsheets often contain complex elements like merged cells, missing values, nested sub-tables, and embedded charts, significantly increasing processing difficulty.

Sensetime Raccoon employs a multimodal mindset to achieve holistic analysis. Through chain-of-thought reasoning, it engages in multi-step thinking and reflection before outputting structured results. In reality, while a table may appear simple, the underlying logical causality can be intricate. Sensetime Raccoon simplifies these complex tables for users.

When a user uploads a complex Excel file containing merged cells, missing values, sub-tables, embedded charts, and external images, Raccoon accurately parses the content, establishes logical connections between sub-tables, and generates a complete analysis report.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 9

In another complex input scenario, a small business owner might encounter useful tabular data while browsing video platforms like Douyin (TikTok). After taking a screenshot and uploading it, Raccoon decomposes the task using image information, filters out noise, extracts the table data, and allows users to export an editable Excel file with one click. Throughout this process—from input to analysis to output—multimodal capabilities ensure smooth execution.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 10

I think direct extraction of data from video platforms raises immediate questions about licensing and fair use. For creators, workflow efficiency gains are hollow if the underlying model lacks transparency in how it processes source material.

Redefining Interaction: From Command to Collaboration

Traditional AI tools mostly play an auxiliary role, with core work still driven by the user. Sensetime Raccoon, however, has upgraded this interaction paradigm: the AI proactively takes on core tasks and interacts through precise questioning and confirmation of key information, mimicking a collaborative workflow between colleagues. By taking initiative in core tasks and confirming details via precise questions, the interaction logic resembles professional collaboration.

The newly launched task planning feature offers an intuitive interaction mode. Taking the recent surge in popularity of the “Scottish Premiership” (Su Chao) as an example:

When a user uploads an image or table requesting an analysis of the top players in the Scottish Premiership, Raccoon automatically gathers online information and leverages expert knowledge to generate a task list (such as defining criteria for “Top 5” players or analyzing y

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 11

The workflow is rigorous: the system conducts a systematic analysis to produce high-quality documents, which can then be exported into editable formats like Excel, PPT, or HTML.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 12

What stood out to me was the proactive nature of Raccoon. Upon receiving a task, it organizes details and asks clear questions at key nodes—such as confirming whether to proceed according to specific points—to ensure directionality. This creates an efficient model where “AI handles the work while users make decisions.”

On licensing, this shift demands that creators learn to direct AI rather than just prompt it. I think decision-making becomes a higher-value skill when execution is automated.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 13

Next, the system generates a task list based on expert knowledge—such as determining “Top 5” standards or analyzing youth training results—for systematic analysis. The next steps and potential collaborative approaches become clear at a glance:

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 14

Professional data integration and tool invocation ensure high-quality content generation:

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 15

Finally, it generates a high-quality analysis document that can be exported into editable formats such as Excel, PPT, or HTML:

With its strong capability in handling complex tasks, Sensetime Raccoon is accelerating industry penetration. This update introduces specialized versions for two specific sectors: Education and Finance.

The Sensetime Raccoon Education Edition can intelligently analyze student performance, course effectiveness, and learning behavior patterns. It currently serves over 500 institutions across more than 10 educational scenarios, impacting over 250,000 teachers and students. It helps improve student learning efficiency by 15–30%, assists teaching research teams in reducing academic anxiety rates by 40% in multiple schools, increases classroom participation by 2.1 times, reduces resource mismatch rates by 30%, and improves the timeliness of mental health interventions by 50%.

For creators, educators must verify these metrics to ensure AI doesn’t replace pedagogical nuance. On licensing, data privacy in education requires strict oversight when AI handles student records.

The Sensetime Raccoon Finance Edition provides financial institutions with knowledge assistants, intelligent data querying tools, and multimodal smart claims solutions, establishing a new paradigm for “human-machine collaborative” intelligent decision-making in the financial sector.

To date, the “Sensetime Raccoon Family” product matrix serves enterprises across multiple industries, with its user base exceeding 10 million.

SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human — figure 16

By activating AI productivity through multimodal technology, the Sensetime SenseNova large model will continue to evolve, partnering with industries to embark on the next stage of AI advancement and accelerating the journey toward the era of Artificial General Intelligence (AGI).