Deep Dive of Google’s TSMixer
Benchmarking TS Mixer: A Deep Dive into an All-MLP Architecture for Time Series Forecasting
I've recently been benchmarking TS Mixer, an all-MLP architecture for time series forecasting, and I wanted to share my thoughts on the paper, how it works, and the results from my experiments.
Background: Why TS Mixer?
TS Mixer was introduced around 18 months ago by researchers at Google. I remember coming across it while waiting for my daughter’s gymnastics class, scrolling through Medium and seeing an article about it. At the time, I found it interesting but didn't dive into it right away. Now, after finally working with it in a proper way, I can share some insights.
The Problem TS Mixer Aims to Solve
Before TS Mixer and other recent advancements, transformer-based models were widely used for long-horizon, multivariate time series forecasting. Models like Autoformer, Fedformer, and Informer were designed to capture temporal dependencies and cross-channel relationships. The assumption was that by leveraging these dependencies collectively, forecasts would be more accurate.
However, research has shown that this isn’t always the case. Many times, simple univariate models—predicting one channel at a time—outperform multivariate models attempting to forecast multiple variables simultaneously. A key turning point was the emergence of linear models like DLinear and NLinear, which demonstrated that simple linear architectures could match or even surpass the performance of complex transformer-based models.
Understanding TS Mixer’s Architecture
TS Mixer takes a different approach by relying on MLP layers rather than attention mechanisms. The core idea is to alternate between time-mixing and channel-mixing layers, ensuring that the model effectively captures both temporal dependencies and cross-channel interactions.
A typical input to a time series forecasting model consists of a three-dimensional tensor:
B (Batch Size): Number of training samples in a batch.
T (Time Dimension): The historical context window, often called the lookback window.
C (Channel Dimension): The number of features in the dataset.
TS Mixer operates by repeatedly swapping the time and channel dimensions, applying MLP blocks to each representation. This design allows it to model both temporal and cross-channel dependencies while keeping the number of parameters relatively low compared to transformer-based architectures.
Key Advantages of TS Mixer
Simplicity – The model architecture is straightforward, avoiding the complexity of multi-head attention mechanisms.
Parameter Efficiency – Unlike transformers, TS Mixer doesn’t require large parameter counts, making it computationally efficient.
Competitive Performance – Research has shown that TS Mixer can perform on par with, or better than, transformer-based models for long-horizon forecasting tasks.
Revisiting Traditional Approaches
The paper discusses three broad categories of time series forecasting models:
Univariate Forecasting – Models that predict a single variable independently, such as ARMA, N-BEATS, and LTSF-Linear.
Multivariate Forecasting – Models that attempt to leverage cross-channel dependencies, including transformer-based architectures like Autoformer and Informer.
Multivariate Forecasting with Auxiliary Information – Models that use additional static or future time-dependent variables, such as DeepAR and Temporal Fusion Transformer (TFT).
While multivariate models theoretically have an advantage, empirical results show that univariate approaches often outperform them, particularly for short-horizon forecasting.
My Thoughts and Benchmarking Results
From my benchmarking experiments, I’ve found that:
Linear models still hold up well, particularly for shorter time horizons.
TS Mixer’s efficiency is impressive, and it provides strong results without the heavy computation needed for transformers.
Transformers are not always the best choice—in many cases, a simpler architecture like TS Mixer or even a univariate model is a better fit.
Conclusion
TS Mixer is a promising alternative to transformer-based models for long-horizon time series forecasting. By leveraging a simple yet effective MLP-based approach, it challenges the notion that complex architectures are always necessary. If you're working with time series data, it’s worth exploring TS Mixer as a computationally efficient, high-performance option.
Have you experimented with TS Mixer or other MLP-based time series models? I’d love to hear your thoughts!