Our Top 7 Forecasting Models We Benchmarked For Monash

benchmarking

31 Jan

Introduction

The forecastingdata.org project is a benchmark of many popular statistical and machine learning forecasting models. The benchmark reports the results of each model against 40 datasets from all sorts of different domains, from sales and finance, to tourism and traffic patterns. Furthermore, the datasets are different sizes and come in a range of frequencies from 10 minute readings to yearly values. The project was initiated by a group of researchers at Monash University and has been referenced by blogs from Hugging Face and the datasets are frequently used in the results of research papers.

The trouble is that since the initial benchmarking was done a few years ago research has moved at such a pace that it looked a little outdated. So I contacted Prof. Christoph Bergmeir from the group at forecastingdata.org and we came up with a plan to update the benchmarks with 7 of the latest state of the art models and in this post I’ll tell you how they performed and give you my take on which was best.

Autoformer

Autoformer was a model proposed in a 2021 paper by researchers from Tsinghua University and followed on from the Informer model which was actually benchmarked previously. In case you hadn’t guessed from the name, like Informer, it’s a Transformer based model with a modification to the attention system which uses AutoCorrelation to make it more efficient and more effective at extracting long term patterns in the time-series. It’s designed as a multivariate model, which requires multivariate datasets and only 14 of the 40 are.

Initially, the performance was so poor I almost didn’t include it in the results. However I suspected that the issue was related to the scaling of the input. In the paper the the results are from datasets that have undergone global normalisation, and in contrast the Monash datasets all use raw values. To compensate for this we used Revin (see this post for a description of what this is) to scale the inputs and this did improve the situation but not by much and was the worse performer of the models we tested. Performance 2/10

Now transformers are hungry beasts when it comes to compute resources and as expected the training time was long in comparison to the other models we tested, and some datasets we just couldn’t run at all, on our CPU based setup. In total it took around 10 times as long to train this model in comparison to the next computationally expensive model. It’s really the sort of model that requires GPU acceleration. Training time 2/10

Being a multivariate model makes it suitable for only multivariate datasets and of the 40 datasets we benchmarked, only 14 are multivariate. In the real world most datasets tend to be univariate and given that the multivariate datasets performance was pretty poorly it wouldn’t be what I reach for by default. Versatility 1/10

Overall Score: 5/30

PatchTST

Like Autoformer, PatchTST is another transformer based architecture based on a paper that was published at ICLR in 2023. The patching mechanism is something that was inspired by ViT the vision transformer, and the idea is that a large lookback or context window of historical timesteps is selected giving us what’s known as a large receptive field and then that input is chunked up into manageable lengths and reconstructed as a 2-d tensor rather than a 1-d vector.

The authors of PatchTST also proposed using Revin to scale the inputs so we didn’t experience the same results as Autoformer. Having said that performance was still disappointing, although it did get a commendable 2nd place in the Solar 10 minute dataset. However, given that in 2024 it’s really considered a state of the art architecture for long horizon multivariate datasets you might have expected it to have performed better than it did. Performance 3/10

Now as this is another transformer we expected the training time to be long ( our implementation had close to 1M parameters). Training time was quicker than Autoformer, but still longer than the others we tested. Training time 5/10

This is another multivariate model, and so like Autoformer it was only suitable for testing on a subset of the datasets that we had. It’s a shame because it really limits where it can be used. Versatility 2/10

Overall Score: 10/30

NLinear

NLinear comes from a “family” of models that were proposed in the 2022 paper “Are transformers really for Time Series forecasting?”. They were proposed not so much as actual state of the art models, but really to prove a point. That using simple Linear models could outperform the complexities of transformers. I’ve written about them before here so check that out for details, but in summary it consists of a single Linear layer with a simple calculation to scale the input.

Performance was actually really good, much better than the transformer based architectures that we tested and honestly better than I expected and topped the leaderboard on the COVID dataset. Performance 6/10

Consisting of just one linear layer it’s lightning quick to train. Things get a little more nuanced if you try to use it in it’s univariate independent mode which I discuss in this post, but you’ll be hard pressed to find any sort of model that will train quicker. Training time: 10/10

Because the idea was to pitch it against the multivariate transformers it really is designed to operate on multivariate datasets, which you might think puts it at an immediate disadvantage when it comes to where it can be used. However, with some modification you can adapt it to work on univariate datasets which is something that we did very successfully. Versatility: 6/10

Overall Score: 22/30

DLinear

DLinear is NLinear’s brother and in some ways it seems to get more of the limelight as it seems to be referenced much more in research paper results, but I can’t really understand why. The difference between the two is that DLinear splits the input lookback window into a long term trend and a seasonal component in a way that is somewhat reminiscent of decomposition used in ETS. The two signals are then passed into single Linear layers and the results are summed together to calculate the forecast to output.

Performance, Training time and Versatility are very similar to NLinear so there’s very little to choose them. There’s a few small differences here and there, but I think both make fantastic candidates to act as a benchmark for any forecasting project that you work on.

Performance 6/10, Training time 10/10, Versatility 6/10

Overall Score 22/30

TiDE

TiDE was proposed by this paper from researchers at Google in 2023 and published at the TMLR conference. The idea was to extend the work of the long term time series research dominated by transformers and evaluate an architecture based on an MLP ( multi layer perceptron). The unique thing about TiDE is that the researchers designed a way of incorporating covariates such as exogeneous variables (features) into the architecture. Now this may have put it at a disadvantage in the benchmarking we did since we don’t use and covariates. If this is a subject that interests you then check out this post that discusses it further. Overall it’s a simple design which relies heavily on residual connections but not anything more exotic than that.

I think the performance is best described as steady, but not stellar. It performed well (2nd overall) in the hourly pedestrian count dataset and also in the monthly sales for carparts, but didn’t top the class in any of the datasets.

Performance 5/10

Training time is pretty quick, but obviously being a legit neural network with activation functions and multiple layers it’s not going to be faster that the Linear models. Having said that we trained models on all the datasets without an issue on a CPU.

Training Time 7/10

This is a global univariate model which means that we were able to test it as designed on all the datasets in the benchmarking test. In my book that gives it an immediate advantage over the multivariate models, it’s just a bit of a shame that it didn’t perform better on a number of the datasets typically those with Monthly or Quarterly frequencies.

Versatility 7/10

Overall Score 19/30

N-HITS

N-HiTS was proposed in 2022 by a mix of researchers from Carnegie Mellon university and Nixtla, the well known time-series forecasting specialists and makers of TimeGPT. The model is another MLP based model and you can think of it as the successor to the popular N-BEATS model that has been benchmarked previously. There are similarities in it’s DNA with it’s use of basis functions and a method to decompose the signal in a sequential manner.

It’s used as a point of comparison in many research papers in the field and with good reason. The performance of this model was truly excellent, and comes top of the class across all models including those previously benchmarked in 4 datasets. Overall it places 2nd behind the statistical model ETS which has a distinct advantage on smaller datasets of lower frequency data such as Yearly and Monthly readings. It’s a solid performer and would be my pick of the bunch to go to in a new project.

Performance 9/10

Training time was very similar to TiDE which makes sense as there are some similarities in the type of architecture that they use.

Training Time 7/10

Like TiDE it is a global univariate model making it suitable for all the datasets that are in the benchmark. There were a few that it didn’t perform well on, but it’s difficult to draw any conclusions about why that should be the case.

Versatility 9/10

Overall Score: 25/30

TimesFM

There’s been a lot of interest in Foundational models for time-series ever since Nixtla released Time-GPT back in late 2023. Think of them as being LLM’s for Time series forecasting. TimesFM comes from the a paper titled: “A decoder-only foundation model for time-series forecasting” by researchers at Google and is a model that has been trained on a huge corpus of synthetics and real-world data. This means that there is no training for us to do and we can evaluate its performance on its zero-shot forecasting abilities. In other words we’ll give it the data from each dataset to make forecasts and if the hype is correct it should return something reasonable. The model itself is a transformer that has some patching mechanism designed into it, so in that regard it’s somewhat similar to PatchTST, but this is a completely different beast.

I have to admit to being sceptical about the performance of this model, but I was in for a shock. On a number of the datasets it performed best overall and was consistently good to adequate on the rest. It really was quite an eye opener, however there is a caveat when looking at these results which is that the TimesFM paper does not explicitly declare all of the datasets that were used in its training. They do mention that at least some of the Monash datasets were used such as the M4 series, so we don’t know for any given dataset if that was part of it’s training material. If it was of course then that would be data leakage and the results wouldn’t be reliable. For this reason I’m going to mark it down on performance.

Performance: 7/10

For zero-shot forecasting there is no training time so things don’t get any better than that.

Training Time: 10/10

Like N-HiTS it performs consistently well across the datasets, but we need that caveat that there’s no data leakage.

Versatility: 7/10

Overall Score: 24/10

Summary

So there we have it a look at 7 recent state of the art models benchmarked on the Monash datasets. For a full set of the results please checkout the forecastingdata.org website or you can also take a look at my F1-Score championship website that I’ve updated with these latest results. We have a list of other models to benchmark over the coming months as time permits, but we’d love to hear from you what models you would like to see us benchmark next, so please reach out drop us a message and let us know.

monash

Gareth Davies

Gareth is an AI researcher and technology consultant specialising in time series analysis, forecasting and deep learning for commercial applications.

https://www.neuralaspect.com