Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Term Time Series Forecasting

Author

Lin Mei Huang · Multimodal & Media AI Editor

Image, video, and audio models — rights, limits, and creative workflows.

About this contributor →

The creative stack just got a new contender, and it’s not another attention-based giant. Tsinghua University and Ant Group have released TimeMixer, a pure Multi-Layer Perceptron (MLP) architecture that claims to surpass Transformers in both performance and efficiency for time series forecasting.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 2

In an era where data-driven decision-making is non-negotiable, time series forecasting has become indispensable. Yet, despite their power, Transformers struggle with high computational complexity and inefficiency when handling long sequences. TimeMixer addresses these pain points by combining temporal trend decomposition with a multi-scale hybrid design, achieving near-linear efficiency through its pure MLP structure.

A Pure MLP Architecture That Beats the Attention Mechanism

I followed the release details, and what stood out to me was the architectural shift. The model employs a multi-scale hybrid design specifically built to handle complex temporal variations. It is primarily constructed on a fully MLP foundation, consisting of two core components: Past Decomposable Mixing (PDM) for extracting historical data patterns, and Future Multipredictor Mixing (FMM) for managing future predictions.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 3

The PDM module is responsible for extracting historical information, separately mixing seasonal and trend components across different scales. Driven by this seasonal and trend mixing, PDM progressively aggregates detailed seasonal information from fine to coarse scales. It utilizes prior knowledge at coarser scales to deeply mine macroscopic trend information, ultimately achieving multi-scale mixing in the extraction of past information.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 4

This approach allows for a more efficient aggregation of data without the quadratic complexity often associated with attention mechanisms. By leveraging multi-scale sequence information, the model aims to enhance forecasting performance for both short- and long-term horizons simultaneously.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 5

I think faster inference times mean lower cloud costs for developers deploying these models at scale.

The FMM component acts as a collection of multiple predictors. Each predictor operates based on past information at different scales, enabling FMM to integrate complementary forecasting functions from mixed multi-scale sequences. This design ensures that the model can capture diverse temporal dependencies without relying on heavy attention heads.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 6

The Efficiency Trade-off in Time Series Forecasting

I followed the release from Tsinghua University and Ant Group, and what stood out to me is who actually benefits when we strip away the Transformer’s attention mechanism. For data engineers and ML ops teams drowning in compute costs, a pure MLP architecture like TimeMixer offers a direct path to lower latency without sacrificing accuracy. But for creative analysts relying on these models for interpretability, the shift raises questions about how much “black box” simplicity we are willing to accept for speed.

For creators, data engineers save compute costs, but analysts lose some transparency in model decisions. On licensing, faster inference helps real-time applications, yet it may marginalize researchers needing deep feature attribution.

To validate TimeMixer’s performance, the team conducted experiments across 18 benchmark datasets. These covered long-term and short-term forecasting, multivariate time series, and spatiotemporal graph structures. The applications ranged from power load forecasting and meteorological predictions to stock price modeling.

The results suggest TimeMixer comprehensively outperforms current state-of-the-art Transformer models on multiple metrics:

Forecasting Accuracy: On all tested datasets, TimeMixer demonstrated higher accuracy. In power load forecasting specifically, it reduced the Mean Absolute Error (MAE) by approximately 15% and the Root Mean Square Error (RMSE) by about 12% compared to Transformer models.

Computational Efficiency: Benefiting from the efficient computational characteristics of MLP structures, TimeMixer significantly outperforms Transformers in both training and inference times. Under identical hardware conditions, it reduced training time by approximately 30% and inference time by about 25%.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 7

Model Interpretability: By introducing Past Decomposable Mixing and Future Multipredictor Mixing techniques, TimeMixer better explains the contribution of information across different temporal scales. This makes the model’s decision-making process more transparent and easier to understand than typical attention-based black boxes.

Generalization Ability: Tested on various dataset types, TimeMixer exhibited strong generalization capabilities, adapting well to different data distributions and features. This suggests broad applicability in practical scenarios where data characteristics shift frequently.

Long-Term Forecasting: To ensure fair comparison, experiments used standardized parameters, adjusting input lengths, batch sizes, and training epochs. The study also included results from comprehensive parameter searches, acknowledging that many prior research results stem from heavy hyperparameter optimization.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 8

Short-Term Forecasting: Multivariate Data

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 9

Ablation Studies: To verify the effectiveness of each component, detailed ablation studies were conducted on all 18 experimental benchmarks. The team examined every possible design variation within the Past-Decomposable-Mixing and Future-Multipredictor-Mixing modules.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 10

Model Efficiency: The team compared runtime memory and time during the training phase with state-of-the-art models. TimeMixer consistently demonstrated excellent efficiency in GPU memory usage and runtime across various sequence lengths (ranging from 192 to 3072), while maintaining consistent state-of-the-art performance for both long-term and short-term forecasting tasks.

Notably, as a deep learning model, TimeMixer exhibits efficiency results comparable to fully linear models. This positions it as a promising solution in scenarios requiring high model efficiency without the overhead of complex attention mechanisms.

Surpassing Transformers: Tsinghua and Ant Group's Pure MLP Architecture Boosts Short- and Long-Te… — figure 11

In summary, TimeMixer brings new perspectives to time series forecasting, demonstrating the potential of pure MLP structures in complex tasks. Looking ahead, with more optimization techniques and application scenarios, it is believed that TimeMixer will further drive the development of this technology, delivering greater value across various industries.

This project was supported by NextEvo, the AI innovation R&D department under Ant Group’s Intelligent Engine Division. Ant Group’s NextEvo Optimization Intelligence Team focuses on intelligent decision-making technologies that combine o

I read the release notes for TimeMixer, and what stood out to me is how Tsinghua University and Ant Group are challenging the dominance of Transformers in time series forecasting. By introducing a pure MLP architecture, they claim to boost both short- and long-term prediction accuracy without the computational overhead typical of attention mechanisms. This isn’t just an academic exercise; it’s a direct hit on the efficiency constraints that plague real-world predictive optimization in operations research.

I think smaller teams can now deploy complex forecasting models without massive GPU budgets. For creators, reduced inference costs mean faster iteration cycles for data-driven product features. On licensing, pure MLPs simplify the stack, lowering the barrier to entry for time-series integration.

The team’s work covers the R&D of algorithmic technologies, platform services, and solutions, bridging the gap between theoretical innovation and practical application. Their approach suggests that we might be over-engineering our predictive models with attention heads when simple linear projections could suffice. This shift matters for anyone building scalable AI products where latency and cost are critical metrics.

Paper Link:
https://arxiv.org/abs/2405.14616v1 Code Repository:
https://github.com/kwuking/TimeMixer

Comments