(This page is based primarily on material from this blog post, and various Wikipedia pages)
We will primarily deal with the problem of forecasting. We consider the single-variable case, although the following can be easily generalised to multiple variables.
We have training data , and we wish to be able to predict and so on. This is significantly different from our standard supervised learning case, and is closer to the self-supervised case where we have to learn the data distribution—the main difference being that our data is not i.i.d., but correlated with other datapoints over time.
Here our models tend to consist of linear operations, combined with "white noise"/"innovations" terms. Here we assume that this noise is Gaussian, with mean zero and some fixed variance: .
When considering our models we treat this white noise as known—it is simply whatever value we need to close our gap to the linear model. When doing prediction for timestep , we can effectively ignore the noise term as it has expectation zero and we don't know what value will be sampled.
Each observation is represented as a sum of the mean of the overall series, its innovation, and a linear combination of the previous innovations.
Each observation is represented as a sum of some fixed constant, its innovation, and a linear combination of the previous observations.
Notation: ARMA(p, q)
Sum of AR and MA terms to model timeseries ( term "absorbed" into ).
Notation: ARIMA(p, d, q)
ARMA with an added "integrated" term, which replaces the autoregressive terms with -order differences to help de-trend the data.
Note that logging the data first can also help remove exponential trends.
Notation: SARIMA(p, d, q, P, D, Q, m)
Like ARIMA, but includes second set of parameters to model seasonal trend. The parameter determines the number of timesteps for the period.
Note that more simple seasonal differencing methods can also be applied.
Information on Wikipedia for this appears to vary slightly, so I'm not certain about what exactly is the best approach—hopefully this explanation accounts for that uncertainty.
This involves fitting the and values. Information on this is inconsistent. For pure MA models it suggests that as the are hard to calculate, non-linear methods must be used. However, in all other cases it suggests that least-squares type approaches can be used.
This involves fitting the p, d, q variables etc. There are two main approaches to this:
The Akaike and Bayesian information criterion gives us a way of "scoring" a model on the training data by combining likelihood and model complexity. This balances over/under-fitting. They are defined as:
Where is the number of parameters, is the size of the dataset, and is the likelihood of the model on the dataset.
Autocorrelation is closely linked to moving-average models. MA(q) is designed so that
- For a true MA(q) process, the autocorrelation is 0 at lag > q.
- When modelling a timeseries as MA(q), we set the value of q to equal the point at which all autocorrelations onwards are insignificantly different from zero.
- We can tell this by plotting the autocorrelation function (ACF; see below)
Fortunately, we have an equivalent for AR(p) models, which are designed so that the partial autocorrelation is 0 at lag > p. We find the ideal value of p in just the same way as q, using a plot of the partial autocorrelation function (PACF).
Assume a timeseries data such as the temperature dataset presented here (source for this section).
First we will consider the autocorrelation function plot (or correlogram). The correlation between two vectors is their normalised inner product, equal to the cosine angle between them. The ACF plot record this value (y-axis) for the time series compared with increasing lag values (x-axis) of itself. The autocorrelation at lag 0 is therefore 1, and in this case then changes cyclically because of the cyclic nature of the original data.
For the ACF plot we also record a blue cone representing the confidence interval. Only values outside of this cone are statistically significantly different to 0. Therefore, in this case the ideal value of is the last value on the x-axis that is outside of this cone.
The partial autocorrelation function calculates the correlation between the data and a lagged version of the data (like ACF), but with the relationships with intermediate lags removed. This means that the PACF represents just the effect that one particular lag has on the correlation. This is why it tends to drop so much faster than the ACF.
Once again, we can use the confidence interval to find our ideal value.
There are two tests that can be used here:
- Null hypothesis: model is not stationary (has a unit root)
- Alternative hypothesis: model is stationary
KPSS: other way round
- Null hypothesis: residuals independently distributed
- Alternative hypothesis: residuals correlated