LTSF-Linear: Embarrassingly simple time series forecasting models

reviewsresearch

13 Sept

You know in a world where there are more transformers in machine learning research than there are in my son's bedroom, it was a breath of fresh air when I read the 2022 paper Are Transformers Effective for Time Series Forecasting by Zeng et al. They suggest that it's possible to achieve comparable performance to state of the art transformers like Autoformer, Informer and Fedformer by using what the authors describe as embarrassingly simple linear models. Now this struck a chord with me partly because my relationship with transformers has been patchy to say the least and partly because being a simple kinda guy I liked the idea of simple models going head to head with the heavy weights in a David vs Goliath stand off. This post is going to take a dive into the paper, the models it introduces and we'll see how they perform in the real world. All the code for this post can be found on my nnts github repo.

Transformers and Long Term Time Series Forecasting (LTSF)

So the paper doesn't waste any time and from the off the authors essentially argue that the requirement for transformers to learn ordering through positional encoding puts it at a disadvantage because of the fact that inevitably some ordering information will be lost. They claim that this is not such a big problem in NLP because there is more to natural language than the precise ordering of words, but in time-series this is not the case.

So what the authors are referring to is the fact that a transformer architecture does not intrinsically understand the temporal relationships between the data points in a time series. This is because the transformer architecture is designed to learn the relationships between the tokens in a set and not the relationships between the positions of the tokens. This is why we need to add positional encoding to the input data to give the model some idea of the order of the tokens.

Instictively this point of view makes sense to me, but I would maybe go further. I think transformers are really well suited to learning abstract concepts, such as language and vision and that is why excel in those fields. However, in time-series and tabular data the statistical properties that we need to make predictions or forecasts are in plain sight.

It's worth bearing in mind that when this paper was written the time-series Transformers were dominated by the "formers" family, like Informer and Autoformer, which were designed to address long forecast horizons on multivariate time series. Consequently the authors focus on comparing their models primarily in this domain although they do present some univariate results which we'll come to later.

They propose 3 models called called Linear, NLinear and DLinear which all come under the family name of LTSF-Linear. They are all single layer linear models with no non-linear activation functions. These models are designed to be simple and useful for benchmarking, but still competitive with the state of the art transformers at the time of writing.

Model Architecture

These models share some common characteristics: the input is a time series of historical values and the output is a vector of future values whose length is the forecast horizon ( see figure 1 for an illustration of this). This is referred to as Direct Multi-step forecasting (DMS) and is in contrast to an auto-regressive model like DeepAR which predicts one step ahead recursively. Now with there being just one linear layer and no non-linear activation functions the model is linear and hence the names DLinear and NLinear. In otherwords there's no deep learning going on here. There's also no proabalistic output so the experiments in the paper optimise the models as a regression task using the Mean Squared Error (MSE) loss function. I guess technically speaking this makes them a linear regression model.

Figure 1 Linear Model Architecture showing Direct Multi-step (DMS) Forecasting

DLinear

DLinear handles the input by splitting it into 2 components. The first component is the trend and is determined by calculating the rolling average of the time series using a moving window defined by a hyperparameter kernel size. This is then subtracted from the original timeseries to give the second component which the authors refer to as the seasonal. The two components are then each passed through a linear layer to project them onto the forecasting output space and then summed to give the final output.

Figure 2 DLinear decomposition of Tourism Monthly

To illustrate what this means in practice figure 2 shows the decomposition of the first series in the Tourism Monthly dataset. The plots shows, from left to right, the original time series, the seasonal component, and the trend component. Note that the seasonal component is closely centered around zero and the trend component sets the initial value and the slope over time. By separating out these components we can now project each one in isolation into the future, the theory being that it is simpler to model the trend and the seasonality separately than it is to model the signal as a single thing.

If this sounds strangely familiar then you're right to think that this is not the first time this idea has been tried. Indeed decomposing a time series into separate components has been around for hundred years or so and is the basis of the well known Holt-Winters statistical model.

NLinear

If you think DLinear is simple then NLinear takes things to the next level. It subtracts the value of the most recent observation from the time series, which effectively scales the observation and then passes this through a single fully connected linear layer to project it onto the forecasting output space. The most recent observation is then added back to the output to give the final forecast. That's it, literally that's the model!

Multivariate vs Univariate

Now I mentioned earlier that these are primarily multivariate models and that the authors present some univariate results. So before we go any further and discuss their experimental setup let's just clarify what we mean by multivariate and univariate time series forecasting.

Most time-series forecasting datasets will comprise of a single file containing multiple time series. For example the Tourism Monthly dataset has 366 individual time series, one for each region in Australia, New Zealand and Hong Kong. Each time series is a sequence of monthly observations and the number of observations in each series may be different as the historical data may start at different times in each region. When we forecast we aim to predict the future values for each of the time series across a forecast horizon, but because the time series in the dataset do not necessarily share a common time span (eg one region may have started collecting data later than another) we do not attempt to model the relationships between each time series, so we forecast each time series individually of the others. This is what we refer to as Univariate. Our input is a single time series and our output is a single time series.

By contrast Multivariate forecasting is where we have multiple timeseries that do share a common time span and we aim to model the association between the time series. This means that the input to the model is essentially a table of values containing a sequence of historical values for each time series in the dataset. Obviously this requires the dataset to be structured in such a way that all the time series are aligned. Examples include ETTh, Electricity, Traffic and Weather datasets and these have almost become the de-facto standard for benchmarking long horizon multivariate time series forecasting models such as Informer and Autoformer. If knowing the temperature of a transformer at two o'clock next Tuesday is important to you then these are the models for you, but in my opinion they are not representative of the majority of time series forecasting problems.

In summary not all datasets are suitable for multivariate forecasting, but all datasets are suitable for univariate forecasting.

Experimental Setup

The models can be configured to operate in a channel dependent multivariate mode or a channel individual univariate mode. The input into the model always take the form of a table (matrix) with dimensions (L, C) where L is the Lookback window length of historical observations and C is the number of time series in the dataset.

Channel Dependent

In this configuration, we take our input as one lump and pass it through the the linear layers. In DLinear there are two ( one seasonal and one trend), and in NLinear there is just one. So effectively we are producing forecasts for all the "channels" in the dataset simulataneously with one or two big matrix multiplications. Computationally this is very efficient which is something that the authors highlight in the paper as a key advantage of these models in comparison to transformers.

Channel Individual

The channel individual mode is presented as somewhat equivalent to a univariate model. In this configuration a set of linear layers are created for each time series in the dataset. The weights of these layers are dedicated to one time series and are not shared. The univariate results are conducted using the ETT dataset, which has 7 time series, and following the same methodology as many of the transformer papers which uses just one time series from the dataset to train and evaluate the model. That is to say one time series is selected (in this case the "OT" feature), all of the other series are discarded and the model is trained and evaluated on this one time series. Now I have an issue with this approach because in my view this makes the model equivalent to a local model (ie a model that is trained for one time series in contrast to global being a single model trained to forecast multiple time series). As such I don't think the results are directly comparable to global univariate models like N-HITS, DeepAR, and N-BEATS. Oh and by the way the authors are not the only offenders here, the same methodology has been used in many of the multivariate transformer papers.

As a side note there is no evidence in the paper that the channel individual configuration has been tested on a dataset with more than one time series, and I think had they done so they would have realised that it's not such a great idea, because handling each time series individually with lots of tiny matrix multiplications is a real drag on performance and is a major problem for datasets that have a large number of time series. I will discuss this further in a future post.

Performance

So how do these models perform? We're going to take a look at the results from the paper with the datasets that they use and then we're going to take a look at how they perform on some of the datasets from the Monash benchmarks and compare them to the best performing models from those benchmarks.

Paper Results

So first I've implemented the code for the DLinear and NLinear models and their data sampling strategy. I've the run some experiments on the ETTh datasets in both univariate and multivariate configurations with a forecast horizon of 336 timesteps. The MAE metrics are shown in table 1 with our results shown in brackets next to the paper's published results.

  
      Model
      DLinear Multivariate
      DLinear Univariate
      NLinear Multivariate
      NLinear Univariate
    

  
      ETTh1 (336)
      0.443 (0.439)
      0.244 (0.235)
      0.427 (0.429)
      0.226 (0.226)
    

      ETTh2 (336)
      0.465 (0.480)
      0.367 (0.369)
      0.400 (0.423)
      0.355 (0.356)
    

      ETTm1 (336)
      0.386
      0.182
      0.388
      0.172
    

      ETTm2 (336)
      0.342
      0.261
      0.327
      0.259
    

*Table 1 MAE results for ETT dataset with a forecast horizon of 336 timesteps from the paper, our results are shown in brackets.

So, generally speaking we are able to do a pretty good job of reproducing the results from the paper, the one exception being the multivariate configuration on the ETTh2 dataset where we see a slightly worse result. In the paper the results are more complete and show that DLinear or NLinear outperform Autoformer, FEDFormer, Informer, Pyraformer, and LogTrans models.

Monash Benchmarks

So far so good, we have 2 models and are confident that we can reproduce the results from the paper. Now let's see how they perform on some of the datasets from the Monash benchmarks. I've selected 6 multivariate datasets from the Monash benchmarks and we're only going to measure the performance of the Multivariate configuration, because as we've discussed the univariate configuration isn't really comparable.

  
      Dataset
      DLinear
      NLinear
      Autoformer
      PatchTST
      Informer
      FFNN
      DeepAR
      N-BEATS
      WaveNet
      Transformer
      Prophet
    

  
      Carparts
752
045
247
075
      
747
747
836
754
746
876
    

      COVID
601
176
221
111
      
459
895
858
835
941
77
    

      Electricity Hourly
880
882
400
138
682
2
516
968
606
522
05
    

      Electricity Weekly
780
792
929
846
444
769
005
800
250
770
924
    

      Traffic Hourly
923
918
      
896
439
892
825
100
066
821
316
    

      Traffic Weekly
096
103
476
168
323
150
182
094
233
555
084
    

*Table 3 DLinear and NLinear MASE compared to various other models. PatchTST and Autoformer results have been produced on the nnts framework. Informer, FFNN, DeepAR and N-BEATS and WaveNet results are all taken directly from the Monash benchmark figures. The best error is shown in bold.

So NLinear is the best model on the COVID dataset, and I think it's fair to say that it compares well with the other multivariate transformer models, but can we really say they outshine the competition? Not really no.

Conclusion

So there we have it a look at the LTSF Linear models and how they perform in different scenarios. I think the authors claim that these models are competitive with state of the art multivariate transformers is justified. But the elephant in the room here is that with small to medium forecast horizons none of these multivariate models are really any better than univariate models and in most cases they are worse. Now it maybe that when you start using long forecast horizons (say in excees of 336 time steps), these models start to come into their own, but for anything shorter the question is: is it really worth it? And in my opinion probably not, but hey what would I know.. I'm just a simple guy trying to avoid treading on transformers in my son's bedroom.

In the next post we'll take a closer looks at channel individual / dependent configurations and see some very interesting results.

time-seriesforecastingreviews

Gareth Davies

Gareth is an AI researcher and technology consultant specialising in time series analysis, forecasting and deep learning for commercial applications.

https://www.neuralaspect.com

Dataset	DLinear	NLinear	Autoformer	PatchTST	Informer	FFNN	DeepAR	N-BEATS	WaveNet	Transformer	Prophet
Carparts	0.752	1.045	1.247	1.075		0.747	0.747	2.836	0.754	0.746	0.876
COVID	5.601	5.176	7.221	8.111		5.459	6.895	5.858	7.835	8.941	12.77
Electricity Hourly	1.880	1.882	2.400	2.138	2.682	3.2	2.516	1.968	1.606	2.522	2.05
Electricity Weekly	0.780	0.792	0.929	0.846	1.444	0.769	1.005	0.800	1.250	1.770	0.924
Traffic Hourly	0.923	0.918		0.896	1.439	0.892	0.825	1.100	1.066	0.821	1.316
Traffic Weekly	1.096	1.103	1.476	1.168	1.323	1.150	1.182	1.094	1.233	1.555	1.084