Pivot Clustering to Minimize Error in Forecasting Aggregated ARMA Demand

In this paper we compare the effects of forecasting demand using individual (disaggregated) components versus first aggregating the components either fully or into several clusters. Demand streams are assumed to follow autoregressive moving average (ARMA) processes. Using individual demand streams will always lead to a superior forecast compared to any aggregates, however we show that if several aggregated clusters are formed in a structured manner then these subaggregated clusters will lead to a forecast with minimal increase in mean-squared forecast error. We show this result based on theoretical MSFE obtained directly from the models generating the clusters as well as estimated MSFE obtained directly from simulated demand observations. We suggest a pivot-algorithm, that we call Pivot Clustering, to create these clusters. We also provide theoretical results to investigate sub-aggregation, including for special cases such as, aggregating MA(1) streams and ARMA streams with similar or same parameters.

Keywords:

Subject: Business, Economics and Management - Econometrics and Statistics

1. Introduction

Modern-day technologies not only permit firms to accurately track their point of sales data and lost sales (purchases not made by customers due to a lack of inventory) data but also gather more granular data. These data streams deluge firms with information which either can be aggregated for planning purposes or considered in its entirety or follow an in-between approach. In this paper we analyze a model in which a retailer is faced with exactly the same choices and provide guidelines for combining the data for the purpose of forecasting demand.

Consider a retailer who has access to its individual customer’s demand streams. Assume that each of these demand streams follow an ARMA model having possibly contemporaneously correlated shock sequences. The primary contribution of this research is to quantify to what extent such a retailer would benefit from forecasting each of the individual streams as opposed to the aggregate. In general, retailers forecast their aggregate demand stream since historically the retailer may only have accurate aggregate demand information and the forecasting of the individual customer demand streams is often thought of as being cumbersome and time consuming. We demonstrate that a retailer observing multiple demand streams generated by ARMA models can drastically reduce its mean squared forecasting error (MSFE) by forecasting the individual demand streams as opposed to just the aggregate demand stream as noted in [1].

Although the retailer’s MSFE is never lower when forecasting using aggregated demand compared to the individual demand streams1, there are cases when the individual streams do not reduce the MSFE. The primary situation where this occurs, as is discussed in the paper, is when various models generating the demand streams have identical ARMA parameter values. For the in-between case, our results demonstrate that the retailer’s MSFE under aggregate forecasting can be greatly reduced if the retailer forecasts various clusters of aggregated demand streams. We show by example that clustering continues to perform well in the event that ARMA models are estimated for non-ARMA data (see Figure 9).

In other words, retailers can make use of data mining and other clustering approaches in order to generate clusters of similar customers and their demand streams. We show that such clustering methods significantly enhance forecast accuracy. This is related to the study of telephone data in [2] where the authors concluded that subaggregated data can be effective for improving forecast accuracy compared to aggregate data. They also note that data is often aggregated to the level that forecasts are required. We describe here ways in which subaggregated clusters can be selected to minimize forecast error. Several examples are mentioned in [3] in the context of assigning demand allocation to different facilities.

Many researchers and practitioners focus on the need to determine clusters of similarly situated customers in order to create and provide customized and/or personalized products. This type of clustering is generally performed on specific characteristics that customers possess (see for example, [4] and the references within). On the other hand, our focus is on forecasting demand for a product by a firm’s customers, recognizing that these customers may have different preferences and hence differing demand. As we describe below, from the demand forecasting perspective, the information contained within the individual demand streams provides the optimal forecast (in terms of minimizing the MSFE and hence inventory related costs) for product demand. Nonetheless, there has been research on the use of clustering methods within a forecasting environment when customer demand data is high dimensional (see for example, [5]).

As opposed to generating clusters based upon customer preferences and customer demographics, we explain how clusters can be generated explicitly from the individual time series structure of the individual demand streams or customers. Even though, it is always optimal from a forecasting perspective to use the individual streams, clusters of similar customer streams may be very helpful to the firm for other reasons as described above. Future empirical work would be necessary to investigate to what extent and in what contexts, clusters generated based upon time series structure of the demand streams correlate to clusters based upon other customer preferences. In such a case where there exists such a relationship, firms could use clustering for simplifying their demand forecasting while identifying groups of customers to receive personalized products.

The purpose of this paper is to demonstrate that clustering based upon time series structure can be utilized within demand forecasting that is superior to forecasting aggregate demand and nearly as good as forecasting the individual demand streams. The structure of our paper is as follows. In the next section, we describe the demand framework and supply chain setting of our research problem, as well as the way that theoretical MSFE computations are determined for the various forecasts (using aggregated demand processes) included herein. In Section 3, we illustrate (through example) that there exists a particular set of subaggregated clusters which results in an MSFE that is close to the MSFE obtained from using disaggregated streams and much lower than the MSFE obtained from the fully aggregated sequence. In Section 4, we describe how to cluster demand streams generated by ARMA models using Pivot Clustering and how this compares to other clustering methods. In Section 5, we describe an objective function that can be minimized to obtain an optimal assignment of streams to clusters in terms of MSFE reduction. Finally, we obtain theoretical results on how demand streams produced by MA(1) models can be clustered in the most efficient way possible to reduce the resulting subaggregated MSFE in Section 6.

2. Model Framework

We consider a retailer with possibly many large customers. In general, retailers forecast their aggregate demand stream since the forecasting of the individual customer demand streams is often thought of as being cumbersome and time consuming. We demonstrate that a retailer observing multiple streams of demand generated by ARMA models can drastically reduce its MSFE by forecasting the individual demand streams or aggregated clusters of similar demand streams. We limit the focus of this paper on ARMA models that describe stationary demand in order to keep the exposition as clear as possible. If we were to consider ARIMA (or Seasonal ARIMA) models then differencing (or seasonal differencing) would need to be carried out on the data to apply the methodology discussed here. We further note that even simple ARMA(1,1) models appearing in Section 4 can have coefficients that produce seasonal patterns in demand realizations.

Hence, consider a retailer that observes multiple demand streams for a single product

{X_{1, t}}, {X_{2, t}}, \dots, {X_{N, t}}

. Each demand stream

{X_{k}}

is assumed to be generated by an ARMA model with respect to a sequence of shocks

{ϵ_{k, t}}

given by

Φ_{k} (B) X_{k, t} = Θ_{k} (B) ϵ_{k, t}

(1)

where

Φ_{k} (z) = 1 + Φ_{k, 1} z + \dots + Φ_{k, p_{k}} z^{p_{k}}

and

Θ_{k} (z) = 1 + Θ_{k, 1} z + \dots + Θ_{k, q_{k}} z^{q_{k}}

, such that

{X_{k, t}}

is invertible and causal with respect to

{ϵ_{k, t}}

(see Brockwell and Davis, page 77 for a definition and discussion about causality and invertibility). We denote the variance of each shock sequence

σ_{k}^{2} = E [ϵ_{k}^{2}]

. Furthermore we note that the shock sequences are potentially contemporaneously correlated with

σ_{i j} = E [ϵ_{i, t} ϵ_{j, t}]

. In general, this set up guarantees that the shocks

ϵ_{k, t}

are the retailer’s Wold shocks (see [6] pp 187-188 for a description of a Wold decomposition of a time series) and that the MSFE of one-step-ahead leadtime demand (when using the disaggregated (individual) streams) is the sum of the elements in the covariance matrix

Σ_{ϵ}

where

Σ_{i j} = σ_{i j}

such that

σ_{k}^{2} = σ_{k k}

(see Equation (7).

The focus of this paper is evaluating the difference in one-step-ahead MSFEs at time t when the forecast of leadtime demand, given by

\sum_{i = 1}^{ℓ + 1} (X_{1, t + i} + X_{2, t + i} + \dots + X_{N, t + i})

, is based on the different series described below where

C_{k, τ} = X_{1, τ}^{C_{k}} + \dots + X_{n_{C_{k}}, τ}^{C_{k}}

. Studying one-step-ahead MSFEs is mathematically simpler than those for general leadtimes since the former does not depend on model parameters.

Disaggregated (individual) sequences	${X_{1, τ}}_{τ = - \infty}^{t}, {X_{2, τ}}_{τ = - \infty}^{t}, \dots, {X_{N, τ}}_{τ = - \infty}^{t}$
Subaggregated (clustered) sequences	${C_{1, τ}}_{τ = - \infty}^{t}, {C_{2, τ}}_{τ = - \infty}^{t}, \dots, {C_{n, τ}}_{τ = - \infty}^{t}$
Aggregated (full) sequence	${D_{τ} = X_{1, τ} + X_{2, τ} + \dots + X_{N, τ}}_{τ = - \infty}^{t}$

Our problem is related to the one posed by [7] where a two-stage supply chain was considered with the retailer observing two demand streams. The focus of that paper was in evaluating information sharing between the retailer and supplier in a situation where the retailer forecasts each demand stream separately. Here we show the benefit to the retailer2 in determining the separate forecasts, while considering the existence of (possibly) more than two demand streams. Kohn [1] was the first to identify conditions under which using the individual demand streams leads to a better forecast than using the aggregated sequence, however he did not determine the MSFE in the two cases. The same conditions can be used to show that if streams are subaggregated into clusters where optimal clusters are always used, then the MSFE of the forecast decreases as the number of clusters increases. Our aim is to determine the optimal cluster assignment based on a predetermined choice of the number of clusters. The number of clusters can be based on the level of detailed data available to a firm or based on the tradeoff from lowering MSFE and increasing the complexity of the model when increasing the amount of clusters used.

In this paper we extend the results of [7] by providing formulas for computing MSFEs under the possibility of more than two streams and a general leadtime in the situation where a player’s (retailer’s) forecast can only be based on their Wold shocks. We also describe how a retailer would forecast its demand by identifying clusters of similar demand streams and then forecasting each cluster after it is aggregated. We provide a pivot algorithm, which we call Pivot Clustering, for identifying a locally optimal assignment of streams into a fixed number of clusters. The algorithm will often find the best possible assignment. We also describe a fast clustering algorithm which results in a globally optimal assignment when demand streams are generated by independent MA(1) models. Our results show that after the algorithms are carried out, much of the benefit of forecasting individual streams can be obtained by forecasting the aggregated clusters.

2.1. Forecasting Using the Disaggregated (Individual) Sequences ${X_{1, τ}}_{τ = - \infty}^{t}, \dots, {X_{n, τ}}_{τ = - \infty}^{t}$

In this section we are interested in forecasting

\sum_{i = 1}^{ℓ + 1} (X_{1, t + i} + X_{2, t + i} + \dots + X_{N, t + i})

when the forecast is based on the diaggregated individual sequences as discussed in the previous section. The forecasting contained here is a textbook multivariate time-series result along with the propagation described in [8] as seen in (5). Thus we consider processes

{X_{k, t}}, \dots, {X_{N, t}}

such that

X_{k, t}

generated by ARMA models given by

Φ_{k} (B) X_{k, t} = Θ_{k} (B) ϵ_{k, t}

(2)

with

σ_{j k} = E [ϵ_{j, t} ϵ_{k, t}]

In this case the disaggregated MSFE given by

M S F E_{i n d} = E [{(\sum_{k = 1}^{N} \sum_{i = 1}^{ℓ + 1} X_{k, t + i} - \sum_{k = 1}^{N} \hat{\sum_{i = 1}^{ℓ + 1} X_{k, t + i}})}^{2}]

(3)

where

\hat{\sum_{i = 1}^{ℓ + 1} X_{k, t + i}}

is the best linear forecast of leadtime demand at time t for stream

{X_{k, t}}

based on (2). We note that

E [(\sum_{i = 1}^{ℓ + 1} X_{j, t + i} - \hat{\sum_{i = 1}^{ℓ + 1} X_{j, t + i}}) (\sum_{i = 1}^{ℓ + 1} X_{k, t + i} - \hat{\sum_{i = 1}^{ℓ + 1} X_{k, t + i}})] = σ_{j k} \sum_{i = 0}^{ℓ} ω_{j, i} ω_{k, i}

(4)

such that

ω_{k, i} = \{\begin{matrix} 0 & i < 0 \\ ψ_{k, i} & i = 0 \\ ω_{k, i - 1} + ψ_{k, i} & 0 < i < ℓ + 1 \\ ω_{k, i - 1} + ψ_{k, i} - ψ_{k, i - ℓ - 1} & i \geq ℓ + 1 . \end{matrix}

(5)

with

ψ_{k, i}

is the

i^{t h}

coefficient appearing in the MA(∞) representation of

{X_{k, t}}

with respect to

{ϵ_{k, t}}

(see [8] for details). That is

1 + ψ_{k, 1} z + ψ_{k, 2} z^{2} + \dots = Ψ_{k} (z)

and

X_{k, t} = Ψ_{k} (B) ϵ_{k, t}

(6)

such that

Ψ_{k} (z) = \frac{Θ_{k} (z)}{Φ_{k} (z)}

Thus the MSFE of the best linear forecast (BLF) of leadtime demand when using the individual sequences

{X_{1, τ}}_{τ = - \infty}^{t}, {X_{2, τ}}_{τ = - \infty}^{t}, \dots, {X_{N, τ}}_{τ = - \infty}^{t}

is given by

M S F E i n d = \sum_{k = 1}^{N} \sum_{j = 1}^{N} \sum_{i = 0}^{ℓ} σ_{k j} ω_{j, i} ω_{k, i} .

(7)

We note that the one-step-ahead (

ℓ = 0

) disaggregated MSFE is given by

\sum_{j = 1}^{N} \sum_{k = 1}^{N} σ_{j k}

. Thus we can compare (7) with (26) below to determine the reduction in MSFE when using the individual sequences as opposed to the aggregated sequence.

2.2. ARMA Representation of a Summed Sequence ${S_{τ} = X_{1, τ} + \dots + X_{s, τ}}_{τ = - \infty}^{t}$

In this subsection we determine the ARMA representation of a series

{S_{t}}

with respect to a series of Wold shocks, where

{S_{t}}

is the sum of several ARMA-generated streams given by

{S_{t}} = {X_{1, t} + X_{2, t} + \dots + X_{s, t}}

. Furthermore we determine the variance of the Wold shocks appearing in this representation. This will allow us to determine the MSFE when forecasts are based on the fully aggregated series as well as when the forecasts are based on subaggregated clusters. In order to obtain the ARMA representation we first need to obtain the spectral density

f_{S} (λ)

and the covariance generating function

G_{S} (z)

{S_{t}}

Proposition 1.

Let

{S_{t}} = {X_{1, t} + X_{2, t} + \dots + X_{s, t}}

. The spectral density of

{S_{t}}

is given by

f_{S} (λ) = \sum_{i = 1}^{s} f_{X_{i}} (λ) + \sum_{i = 1}^{s - 1} \sum_{j = i + 1}^{s} (f_{X_{i} X_{j}} (λ) + {\bar{f}}_{X_{i} X_{j}} (λ))

(8)

where the cross-spectrum

f_{X_{i} X_{j}} (λ)

is defined

f_{X_{i}, X_{j}} (λ) = \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} C_{X_{i} X_{j}} (r)

with

C_{X_{i} X_{j}} (r) = E [X_{i, t + r} X_{j, t}]

and

{\bar{f}}_{X_{i} X_{j}} (λ) = \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{i λ r} C_{X_{i} X_{j}} (r)

Proof. We will prove this by induction. Note that when

{S_{t}} = {X_{1, t} + X_{2, t}}

f_{S} (λ)

is given by

\begin{matrix} f_{S} (λ) & = & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [(X_{1, t + r} + X_{2, t + r}) (X_{1, t} + X_{2, t})] \\ = & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{1, t + r} X_{1, t}] + \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{2, t + r} X_{2, t}] \end{matrix}

(9)

\begin{matrix} + & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{1, t + r} X_{2, t}] + \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{2, t + r} X_{1, t}] . \end{matrix}

(10)

Noting that

\sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{2, t + r} X_{1, t}] = \sum_{r = - \infty}^{\infty} e^{i λ r} E [X_{2, t} X_{1, t + r}] = {\bar{f}}_{X_{1} X_{2}} (λ)

we see that

f_{S} (λ) = f_{X_{1}} (λ) + f_{X_{2}} (λ) + f_{X_{1} X_{2}} (λ) + {\bar{f}}_{X_{1} X_{2}} (λ)

which matches representation (8).

Now suppose (8) holds for

{S_{n, t}} = {X_{1, t} + \dots + X_{n, t}}

processes. Thus

f_{S_{n}} (λ) = \sum_{i = 1}^{n} f_{X_{i}} (λ) + \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} (f_{X_{i} X_{j}} (λ) + {\bar{f}}_{X_{i} X_{j}} (λ))

(11)

Consider

{S_{n + 1}} = {S_{n, t} + X_{n + 1, t}}

. Since

{S_{n, t}}

follows and ARMA model, we observe that

f_{S_{n + 1}} (λ) = f_{S_{n}} (λ) + f_{X_{n + 1}} (λ) + f_{S_{n}, X_{n + 1}} (λ) + {\bar{f}}_{S_{n}, X_{n + 1}} (λ)

(12)

Note that starting with the definition of

f_{S_{n}, X_{n + 1}} (λ)

\begin{matrix} f_{S_{n}, X_{n + 1}} (λ) & = & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [S_{n, t} X_{n + 1, t + r}] \end{matrix}

(13)

\begin{matrix} = & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{1, t} X_{n + 1, t + r} + \dots + X_{n, t} X_{n + 1, t + r}] \end{matrix}

(14)

\begin{matrix} = & \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{1, t} X_{n + 1, t + r}] + \dots + \frac{1}{2 π} \sum_{r = - \infty}^{\infty} e^{- i λ r} E [X_{n, t} X_{n + 1, t + r}] \end{matrix}

(15)

\begin{matrix} = & f_{X_{1} X_{n + 1}} (λ) + \dots + f_{X_{n} X_{n + 1}} (λ) \end{matrix}

(16)

Similarly,

{\bar{f}}_{S_{n}, X_{n + 1}} (λ) = {\bar{f}}_{X_{1} X_{n + 1}} (λ) + \dots + {\bar{f}}_{X_{n} X_{n + 1}} (λ)

Thus from (12),

\begin{matrix} f_{S_{n + 1}} (λ) & = & f_{S_{n}} (λ) + f_{X_{n + 1}} (λ) + f_{X_{1} X_{n + 1}} (λ) + \dots \\ + & f_{X_{n} X_{n + 1}} (λ) + {\bar{f}}_{X_{1} X_{n + 1}} (λ) + \dots + {\bar{f}}_{X_{n} X_{n + 1}} (λ) \end{matrix}

(17)

or equivalently

\begin{matrix} f_{S_{n + 1}} (λ) = \sum_{i = 1}^{n} f_{X_{i}} (λ) & + & \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} (f_{X_{i} X_{j}} (λ) + {\bar{f}}_{X_{i} X_{j}} (λ)) \\ + & f_{X_{n + 1}} (λ) + f_{X_{1} X_{n + 1}} (λ) + \dots \\ + & f_{X_{n} X_{n + 1}} (λ) + {\bar{f}}_{X_{1} X_{n + 1}} (λ) + \dots + {\bar{f}}_{X_{n} X_{n + 1}} (λ) \end{matrix}

which can be simply written as

f_{S_{n + 1}} (λ) = \sum_{i = 1}^{n + 1} f_{X_{i}} + \sum_{i = 1}^{n} \sum_{j = i + 1}^{n + 1} (f_{X_{i} X_{j}} (λ) + {\bar{f}}_{X_{i} X_{j}} (λ))

(18)

and the result is proved. □

Now consider the covariance generating function

G_{S} (z) = \sum_{j = - \infty}^{\infty} E [S_{t} S_{t - j}] z^{j}

{S_{t}}

. Here we use the equivalence

G_{S} (e^{- i λ}) = 2 π f_{S} (λ)

and note the following:

\begin{matrix} f_{X_{i}} (λ) & = & \frac{σ_{i}^{2}}{2 π} \frac{| Θ_{i} (e^{- i λ}) |^{2}}{| Φ_{i} (e^{- i λ}) |^{2}} \end{matrix}

(19)

\begin{matrix} f_{X_{i} X_{j}} (λ) & = & \frac{σ_{i j}}{2 π} \frac{Θ_{i} (e^{- i λ})}{Φ_{i} (e^{- i λ})} \frac{Θ_{j} (e^{i λ})}{Φ_{j} (e^{i λ})} \end{matrix}

(20)

\begin{matrix} {\bar{f}}_{X_{i} X_{j}} (λ) & = & \frac{σ_{i j}}{2 π} \frac{Θ_{i} (e^{i λ})}{Φ_{i} (e^{i λ})} \frac{Θ_{j} (e^{- i λ})}{Φ_{j} (e^{- i λ})} \end{matrix}

(21)

Thus from Proposition 1 we observe that

G_{S} (z) = \sum_{i = 1}^{s} σ_{i}^{2} \frac{Θ_{i} (z) Θ_{i} (z^{- 1})}{Φ_{i} (z) Φ_{i} (z^{- 1})} + \sum_{i = 1}^{s - 1} \sum_{j = i + 1}^{s} (σ_{i j} \frac{Θ_{i} (z)}{Φ_{i} (z)} \frac{Θ_{j} (z^{- 1})}{Φ_{j} (z^{- 1})} + σ_{i j} \frac{Θ_{i} (z^{- 1})}{Φ_{i} (z^{- 1})} \frac{Θ_{j} (z)}{Φ_{j} (z)})

(22)

As described in Theorem 5 of [7], the covariance generating function

G_{S} (z)

can be factorized as the ratio

\frac{O (z) P (z)}{Q (z)}

where

O (z)

P (z)

and

Q (z)

are Laurent polynomials, with

O (z)

having all its roots on the unit circle and

P (z)

and

Q (z)

having no roots on the unit circle. This result follows from the fact that each additive term in (77) is a ratio of Laurent Polynomials and the fact that for any Laurent polynomial

P (z)

, both

P (z) P (z^{- 1})

and

P (z) + P (z^{- 1})

will be Laurent polynomials. Furthermore if

P_{1} (z)

and

P_{2} (z)

are Laurent polynomials then

P_{1} (z) P_{2} (z)

and

P_{1} (z) + P_{2} (z)

will be Laurent polynomials as well.

We can now use the factorization provided in [9] and described in [7] to obtain the ARMA representation of

{S_{t}}

with respect to the Wold shocks

{ϵ_{t}}

(appearing in its Wold representation). It should be noted that Remark 1 is simply a restatement of Theorem 5 of [7] with the slight addition of determining the polynomials appearing in the ARMA representation of the aggregate sequence

{S_{t}}

Remark 1.

{S_{t}}

can be represented with respect to shocks

{ϵ_{t}}

using the ARMA model

Φ (B) S_{t} = Θ (B) ϵ_{t}

(23)

where

Θ (z) = \prod_{i = 1}^{m} (1 - a_{i} z)

where

{a_{i}}

are the roots of

O (z) P (z)

on or inside the unit circle and

Φ (z) = \prod_{i = 1}^{n} (1 - b_{i} z)

where

{b_{i}}

are the roots of

Q (z)

inside the unit circle. Furthermore

σ_{ϵ}^{2} = E [ϵ_{t}^{2}] = \frac{p_{m} \prod_{j = 1}^{m} (- 1 / a_{j})}{q_{n} \prod_{j = 1}^{n} (- 1 / b_{j})}

(24)

where

p_{m}

is the coefficient of

z^{m}

P (z)

and

q_{n}

is the coefficient of

z^{n}

Q (z)

2.3. Forecasting Using the Fully Aggregated Sequence ${D_{τ}}_{τ = - \infty}^{t}$ For A General Leadtime

We note that Remark 1 can be used to obtain the ARMA representation of

{D_{t}}

with respect to its Wold shocks

{ϵ_{t}}

as well as

σ_{ϵ}^{2} = E [ϵ_{t}^{2}]

. We can therefore use Lemma 1 of [8] and its proof to determine the BLF of

(\sum_{k = 1}^{ℓ + 1} D_{t + k})

and its MSFE when forecasting using the infinite past of

{D_{t}}

up to time t, namely

{D_{τ}}_{τ = - \infty}^{t}

. That is, if we consider the MA(∞) representation of

{D_{t}}

with respect to

{ϵ_{t}}

given by

D_{t} = Ψ (B) ϵ_{t}

(25)

where

Ψ (z) = 1 + Ψ_{1} z + Ψ_{2} z^{2} + \dots

, then the MSFE of the BLF when using

{D_{τ}}_{τ = - \infty}^{t}

M S F E a g g = σ_{ϵ}^{2} \sum_{i = 0}^{ℓ} ω_{i}^{2}

(26)

where

ω_{i} = \{\begin{matrix} 0 & i < 0 \\ ψ_{i} & i = 0 \\ ω_{i - 1} + ψ_{i} & 0 < i < ℓ + 1 \\ ω_{i - 1} + ψ_{i} - ψ_{i - ℓ - 1} & i \geq ℓ + 1 . \end{matrix}

(27)

We note that

Ψ (z) = \frac{Θ (z)}{Φ (z)}

2.4. Forecasting Using Subaggregated Sequences

In this section we use the results in the previous section to create a forecast and compute its MSFE when some of the individual streams are subaggregated. That is, for

k \in 1 \dots n

, let cluster

C_{k}

consist of

n_{C_{k}}

streams such that

{C_{k, t} = X_{1, t}^{C_{k}} + \dots + X_{n_{C_{k}}, t}^{C_{k}}}

. We are interested in the forecast and MSFE based on

{C_{1, τ}}_{τ = - \infty}^{t}, \dots, {C_{n, τ}}_{τ = - \infty}^{t}

Section 2.2 describes how we can obtain the ARMA representation and variance of the Wold shocks appearing in the Wold representation for each sequence

{C_{k, t}}

. This can then be used to create a forecast from each subaggregated sequence, the sum of which can be taken as the forecast for

D_{t + ℓ + 1}

. The one-step ahead MSFE of this forecast is the sum of the entries of the covariance matrix of the Wold shocks appearing in the Wold representation of

{C_{1, t}, \dots, C_{n, t}}

. Equation (7) describes how we can also use the covariance matrix of the shocks of

{C_{1, t}, \dots, C_{n, t}}

to obtain the MSFE for multi-step ahead forecasts. The remainder of this section will focus on obtaining this covariance matrix.

Without loss of generality, consider two subaggregated series

{C_{1, t} = X_{1, t} + \dots + X_{a, t}}

and

{C_{2, t} = X_{a + 1, t} + \dots + X_{b, t}}

with ARMA representations

\begin{matrix} ϕ_{1}^{★} (B) C_{1, t} & = & θ_{1}^{★} (B) ϵ_{1, t}^{★} \end{matrix}

(28)

\begin{matrix} ϕ_{2}^{★} (B) C_{2, t} & = & θ_{2}^{★} (B) ϵ_{2, t}^{★} \end{matrix}

(29)

where

{ϵ_{1, t}^{★}}

and

{ϵ_{2, t}^{★}}

are the shocks appearing in the Wold representation of

{C_{1, t}}

and

{C_{2, t}}

respectively. We note that the variances of

{ϵ_{1, t}^{★}}

and

{ϵ_{2, t}^{★}}

can be obtained using Remark 1. To obtain the covariance

σ_{12}^{★} = E [ϵ_{1, t}^{★} ϵ_{2, t}^{★}]

consider the following.

We can rewrite the ARMA representations above as

ϵ_{1, t}^{★} = \frac{ϕ_{1}^{★} (B)}{θ_{1}^{★} (B)} C_{1, t}

(30)

ϵ_{2, t}^{★} = \frac{ϕ_{2}^{★} (B)}{θ_{2}^{★} (B)} C_{2, t}

(31)

The ARMA representations of

{X_{1, t}}, \dots, {X_{b, t}}

can also be rewritten as

X_{1, t} = \frac{Θ_{1} (B)}{Φ_{1} (B)} ϵ_{1, t} \dots X_{b, t} = \frac{Θ_{b} (B)}{Φ_{b} (B)} ϵ_{b, t}

(32)

Based on the definition of

C_{1, t} = X_{1, t} + \dots + X_{a, t}

and

C_{2, t} = X_{a + 1, t} + \dots + X_{b, t}

we observe that

ϵ_{1, t}^{★} = \frac{ϕ_{1}^{★} (B)}{θ_{1}^{★} (B)} [\frac{Θ_{1} (B)}{Φ_{1} (B)} ϵ_{1, t} + \dots + \frac{Θ_{a} (B)}{Φ_{a} (B)} ϵ_{a, t}]

(33)

ϵ_{2, t}^{★} = \frac{ϕ_{2}^{★} (B)}{θ_{2}^{★} (B)} [\frac{Θ_{a + 1} (B)}{Φ_{a + 1} (B)} ϵ_{a + 1, t} + \dots + \frac{Θ_{b} (B)}{Φ_{b} (B)} ϵ_{b, t}]

(34)

To obtain

E [ϵ_{1, t}^{★} ϵ_{2, t}^{★}]

we need to compute the expectation of the the product of the right-hand-sides of these two equations. Thus we need to consider the sum of terms such as

E [\frac{ϕ_{1}^{★} (B)}{θ_{1}^{★} (B)} \frac{Θ_{i} (B)}{Φ_{i} (B)} ϵ_{i, t} \frac{ϕ_{2}^{★} (B)}{θ_{2}^{★} (B)} \frac{Θ_{j} (B)}{Φ_{j} (B)} ϵ_{j, t}] .

(35)

Note that we can write

\frac{ϕ_{1}^{★} (B)}{θ_{1}^{★} (B)} \frac{Θ_{i} (B)}{Φ_{i} (B)} ϵ_{i, t} = \sum_{k = 0}^{\infty} {\tilde{ψ}}_{i, k} ϵ_{i, t - k}

(36)

and

\frac{ϕ_{2}^{★} (B)}{θ_{2}^{★} (B)} \frac{Θ_{j} (B)}{Φ_{j} (B)} ϵ_{j, t} = \sum_{k = 0}^{\infty} {\tilde{ψ}}_{j, k} ϵ_{j, t - k}

(37)

where

{\tilde{ψ}}_{i, k}

and

{\tilde{ψ}}_{j, k}

can be obtained in the same way as the MA(∞) coefficients in (27). Hence the term in Equation (35) can be rewritten as

E [\sum_{k = 0}^{\infty} {\tilde{ψ}}_{i, k} ϵ_{i, t - k} \sum_{k = 0}^{\infty} {\tilde{ψ}}_{j, k} ϵ_{j, t - k}]

(38)

Since the shock sequences are not correlated across time by assumption, this is equivalent to

E [\sum_{k = 0}^{\infty} {\tilde{ψ}}_{i, k} {\tilde{ψ}}_{j, k} ϵ_{i, t - k} ϵ_{j, t - k}]

(39)

or equivalently

\sum_{k = 0}^{\infty} {\tilde{ψ}}_{i, k} {\tilde{ψ}}_{j, k} σ_{i j}

(40)

Adding up these terms as required would yield the covariance. Hence

σ_{12}^{★} = E [ϵ_{1, t}^{★} ϵ_{2, t}^{★}] = \sum_{i = 1}^{a} \sum_{j = a + 1}^{b} \sum_{k = 0}^{\infty} {\tilde{ψ}}_{i, k} {\tilde{ψ}}_{j, k} σ_{i j}

(41)

We note that this methodology can easily be extended to obtain any of the covariances in the covariance matrix

Σ_{ϵ}^{★} = (\begin{matrix} σ_{11}^{★} & \dots & σ_{1 n}^{★} \\ σ_{21}^{★} & ⋱ & σ_{2 n}^{★} \\ ⋮ & ⋮ \\ σ_{n 1}^{★} & \dots & σ_{n n}^{★} \end{matrix})

(42)

We will use the previous methodology for all theoretical MSFE computations found in this paper. In the next subsection, we provide an example describing the importance of forecasting leadtime demand based upon the individual sequences.

2.5. Example

Consider a retailer that observes Aggregate demand

{D_{t} = X_{1, t} + X_{2, t} + X_{3, t}}

where each individual demand stream is generated by one of the following ARMA models:

\begin{matrix} X_{1, t} & = & (1 - . 9 B) ϵ_{1, t} \end{matrix}

(43)

\begin{matrix} X_{2, t} & = & (1 + . 9 B) ϵ_{2, t} \end{matrix}

(44)

\begin{matrix} X_{3, t} & = & (1 + . 9 B) ϵ_{3, t} \end{matrix}

(45)

where the shock covariance matrix is given by

Σ = (\begin{matrix} 1.6 & - 1.4 & 0.5 \\ - 1.4 & 1.3 & - 0.8 \\ 0.5 & - . 8 & 2.0 \end{matrix})

We use the results described in Section 2.1 and Section 2.2 to compare the MSFE of the forecasts of leadtime demand

\sum_{j = 1}^{ℓ + 1} D_{t + j}

when using the individual sequences

{X_{1, t}}_{τ = - \infty}^{t}, {X_{2, t}}_{τ = - \infty}^{t}, {X_{3, t}}_{τ = - \infty}^{t}

versus when using the aggregate sequence

{D_{τ}}_{τ = - \infty}^{t}

The covariance generating function of

{D_{t}}

is given by

G_{S} (z) = \frac{0.090 z^{- 1} + 5.631 + 0.090 z}{1}

(46)

and the ARMA representation of

{D_{t}}

is given by

D_{t} = (1 + . 01598704 B) ϵ_{t}

(47)

where

σ_{ϵ}^{2} = 5.629561

. We can check that the ARMA representation of

{D_{t}}

and its covariance generating function match up by noting that

G_{S} (z) = σ_{ϵ}^{2} \frac{Θ (z) Θ (z^{- 1})}{Φ (z) Φ (z^{- 1})} = 5.629561 (1 + . 01598704 z) (1 + . 01598704 z^{- 1}) = 0.090 z^{- 1} + 5.631 + 0.090 z

(48)

with

Θ (z) = (1 + . 01598704 z)

and

Φ (z) = 1

being the MA and AR polynomials appearing in the ARMA representation of

{D_{t}}

as defined in Remark 1.

We note that the one-step ahead forecast then has a MSFE=5.629561 and a two-step ahead forecast (of

D_{t + 1} + D_{t + 2}

) has a MSFE=11.44056.

Using equation (7) and noting that

ω_{k, 0} = ψ_{k, 0} = 1

in equation (5) we can obtain the MSFE of the forecast based on the individual demand streams. In this case the MSFE is 1.5, which is 4.129561 (73.3%) lower that the forecast error when using the aggregated series. We can likewise compute the elements

ω_{k, i}

when forecasting

D_{t + 1} + D_{t + 2}

. In this case,

ω_{k, 0} = ψ_{k, 0} = 1

and

ω_{1, 1} = . 1

ω_{2, 1} = 1.9

, and

ω_{3, 1} = 1.9

. From (7) this implies that the MSFE=7.311, which is also 4.129561 (36.1%) lower3 than the forecast error when using the aggregated series. In the next section we demonstrate, with example, the benefit of forecasting using subaggregated clusters (which would have been identified using a suitable technique).

3. The Benefit of Forecasting Using Subaggregated Clusters

In this section, we consider a retailer that has ten demand streams, which aggregate into three clusters consisting of similar streams. The models generating these streams are specifically chosen to provide a clear separation between "good" clusters leading to low MSFE and "bad" clusters leading to high MSFE. Intuition gleaned from Section 6 and Section 7 hint that streams generated from ARMA models with similar coefficients should be clustered together. Hence for our example we consider three groups of models having similar sets of coefficients within each group. Later, in Section 4, we will randomly assign coefficients to streams and still observe a sharp drop in MSFE when streams are clustered to minimize MSFE.

Suppose the retailer observes Aggregate demand

{D_{t} = X_{1, t} + X_{2, t} + \dots + X_{10, t}}

where each individual demand stream is generated by one of the following ARMA processes:

\begin{matrix} (1 - . 3 B - . 6 B^{2}) X_{1, t} & = & (1 - . 6 B - . 2 B^{2}) ϵ_{1, t} \end{matrix}

(49)

\begin{matrix} (1 - . 35 B - . 5 B^{2}) X_{2, t} & = & (1 - . 65 B - . 15 B^{2}) ϵ_{2, t} \end{matrix}

(50)

\begin{matrix} (1 - . 27 B - . 55 B^{2}) X_{3, t} & = & (1 - . 63 B - . 17 B^{2}) ϵ_{3, t} \end{matrix}

(51)

\begin{matrix} (1 - . 8 B) X_{4, t} & = & ϵ_{4, t} \end{matrix}

(52)

\begin{matrix} (1 - . 9 B) X_{5, t} & = & ϵ_{5, t} \end{matrix}

(53)

(54)

\begin{matrix} (1 - . 75 B) X_{6, t} & = & ϵ_{6, t} \end{matrix}

(55)

\begin{matrix} (1 + . 77 B) X_{7, t} & = & (1 + . 6 B) ϵ_{7, t} \end{matrix}

(56)

\begin{matrix} (1 + . 68 B) X_{8, t} & = & (1 + . 55 B) ϵ_{8, t} \end{matrix}

(57)

\begin{matrix} (1 + . 73 B) X_{9, t} & = & (1 + . 52 B) ϵ_{9, t} \end{matrix}

(58)

\begin{matrix} (1 + . 7 B) X_{10, t} & = & (1 + . 5 B) ϵ_{10, t} \end{matrix}

(59)

where the shock covariance matrix is given by

Σ = (\begin{matrix} 2 & 1 & 0.8 & - 0.9 & - 1.2 & - 1.5 & 0.8 & 0.9 & 0.95 & 1 \\ 1 & 2.1 & 0.7 & - 0.6 & - 0.5 & - 0.4 & 0.21 & 0.31 & 0.36 & 0.39 \\ 0.8 & 0.7 & 2.2 & - 0.5 & - 1.3 & - 1 & 0.4 & 0.8 & 1 & 1.1 \\ - 0.9 & - 0.6 & - 0.5 & 3 & 1.8 & 1.9 & - 2 & - 2.1 & - 2.2 & - 2.3 \\ - 1.2 & - 0.5 & - 1.3 & 1.8 & 3.2 & 2 & - 1.9 & - 1.8 & - 1.7 & - 1.5 \\ - 1.5 & - 0.4 & - 1 & 1.9 & 2 & 3.3 & - 2.2 & - 2.3 & - 2.4 & - 2.5 \\ 0.8 & 0.21 & 0.4 & - 2 & - 1.9 & - 2.2 & 5 & 1 & 1.25 & 1.5 \\ 0.9 & 0.31 & 0.8 & - 2.1 & - 1.8 & - 2.3 & 1 & 5.1 & 1.3 & 1.6 \\ 0.95 & 0.36 & 1 & - 2.2 & - 1.7 & - 2.4 & 1.25 & 1.3 & 5.7 & 1.8 \\ 1 & 0.39 & 1.1 & - 2.3 & - 1.5 & - 2.5 & 1.5 & 1.6 & 1.8 & 5.9 \end{matrix})

(60)

Consider the three natural clusters in the above ten demand streams, namely, streams 1-3, 4-6, and 7-10. It can be shown that indeed this grouping is optimal out of any other possible choice of three clusters simply by looking at all possible combinations and computing their MSFEs. Our analysis of the MA(1) case in Section 6 also points to these being the correct clusters based on the proximity of the ARMA coefficients of the models generating the demand streams. We demonstrate that even though the best the retailer can do in such a situation is forecast all ten streams individually, if the retailer would correctly cluster the demand streams as mentioned, the results are similar. All MSFEs stated in this section are obtained using the methods described in Section 2.

Specifically in this case if the retailer were to forecast the individual streams, its one-step ahead theoretical MSFE would be 21.64. If the retailer were to consider the subaggregated processes consisting of the correct clusters and forecast these separately then the MSFE would be 21.74. This is in stark contrast to a MSFE of 61.39 when forecasting using the aggregate process of all ten streams. In other words, it is sufficient to determine clusters of similar customers as opposed to forecasting each individual stream in order to keep inventory related costs down.

To see the impact of choosing the correct clusters we consider the case that the retailer incorrectly clusters the streams as 1-2, 3-5, and 6-10. In this case the MSFE rises to 33.4. Similarly, if the clusters chosen are

1 & 4, 2 & 3 & 5

, and 6-10 then the MSFE is 45.04. Assigning streams randomly to three clusters consisting of three, two and five streams yields the following table.

Table 1. The MSFEs for various clusters of three, two and five streams.

MSFE	Clusters
52.34495576	6,10,9 and 1,2 and 8,3,4,5,7
51.90912188	2,10,5 and 3,8 and 4,1,7,6,9
31.40789218	7,2,10 and 8,9 and 5,4,6,3,1
44.15962369	10,1,7 and 3,4 and 8,2,9,6,5
50.32525078	5,8,3 and 1,10 and 4,2,7,6,9
39.31100769	6,9,5 and 10,7 and 1,3,8,4,2
45.09358141	3,1,10 and 5,8 and 2,4,9,7,6
51.54828609	9,5,10 and 4,2 and 7,6,8,1,3
34.21154829	8,7,2 and 1,9 and 6,4,5,10,3
55.21445794	6,1,10 and 2,4 and 7,3,8,5,9

We note that there could be substantial reduction in MSFE even when multiple streams are clustered incorrectly. In the next section, we demonstrate how a retailer would be able to generate clusters of its individual demand streams using Pivot Clustering.

4. TS Clustering Algorithms and Pivot Clustering, Empirical Evaluation

We have shown that the when the retailer forecasts using clusters of its demand streams as opposed to the individual streams, its MSFE can vary greatly from close to the optimal value (when forecasting using the individual streams) to close to the MSFE of a forecast based on the aggregate of all the streams. Hence, if the retailer wishes to forecast using clusters of its demand streams, the selection of the clusters is important. In general, a retailer might have customer related information that could be used to generate the clusters. The benefit of generating the clusters and forecasting them is that there could be situations where it would be cumbersome for the retailer to collect the individual demand streams and service them individually. Forecasting clusters would therefore be a second best option.

We propose Pivot Clustering for determining clusters which usually results in a relatively low MSFE among all choices of streams to clusters. We consider two ways to obtain the subaggregated MSFE based on some clustering assignment. The first is to use the individual ARMA demand models appearing in (2) to compute the subaggregated theoretical MSFE as per Section 2.2, Section 2.3 and Section 2.4. We note that if there is only one cluster then the MSFE is the one computed for a forecast based on the aggregate of the all the streams while if the number of clusters is equal to the number of streams, the MSFE is for a forecast based on the disaggregated (individual) sequences. We also estimate the MSFE by generating demand realizations for each stream based on (2). Once a demand realizations are simulated for each stream, we subaggregate the realizations based on our choice of clusters. So if some cluster is made up of streams

{X_{i, t}}

and

{X_{j, t}}

, we say that the cluster has realization

{X_{i, t} + X_{j, t}}

. We then estimate an ARMA(5,5)4 model using each cluster’s realization. Finally we use the estimated models to obtain in-sample forecast errors and compute the covariance matrix of these forecast errors to estimate the MSFE for a particular assignment of streams to clusters. In the analysis below we see that the estimated MSFEs are close to theoretical ones and often lead to similar choices of clusters. Based on a predetermined number of clusters n, Pivot Clustering works as follows.

For each stream, randomly assign it to a cluster.

For each cluster,

For each stream in the cluster,

Compute or estimate the MSFE for the current assignment along

with the resulting MSFEs if the stream was moved to each of the other clusters.

# MSFE can be either estimated based on realizations of the demand streams or

# computed using Equations (28), (29), (41) and (42).

Move each stream in the cluster to a cluster which leads to largest overall MSFE-

reduction among all choice of clusters.

In the remainder of this section we perform various simulations to assess the efficacy of Pivot clustering. We focus on ARMA(1,1) models as these do not require too much runtime for Pivot clustering based on theoretical MSFE and are complex enough to describe demand data such as in [10]. Additionally forecasting an aggregate of ARMA(1,1) demand sequences has been studied by [11] where forecasts were based on exponential smoothing. The methods herein are generally applicable however to higher-order ARMA models. From a computational standpoint, it is possible to determine theoretical MSFE based on the aggregate of up to twenty demand streams generated by ARMA(1,1) models. The burden lies in having to find roots of large degree polynomials in order to determine the ARMA model and shock variance that describes the aggregated sequence. To understand the computational requirements we check the runtimes of Pivot Clustering when determining theoretical and estimated MSFEs. Based on a given number of streams (between 10 and 20) we carry out Pivot Clustering for twenty different combinations of ARMA(1,1) models and plot the average of the runtimes in Figure 1 in assigning the streams to four clusters. If using estimated MSFE then many more streams can be clustered and Pivot Clustering has faster runtimes for larger amounts of streams. Upon checking, Pivot Clustering with estimated MSFEs for 200 streams takes approximately 20 minutes.

We can check the efficacy of Pivot Clustering through simulation. We begin by randomly assigning coefficients to twenty ARMA(1,1) models to produce twenty demand streams as well as the covariance matrix of the shock sequences. We make sure that each assignment results in causal and invertible demand with respect to the shocks and that the resulting covariance matrix is positive definite. The AR and MA coefficients and covariance matrix can be found in https://github.com/vkovtun84/Pivot-Clustering-of-Demand-Streams under Models.csv and covarmat.csv.

After randomly assigning streams to one of four clusters we compute both the estimated and theoretical one-step-ahead MSFEs based on this random assignment and use it to start Pivot Clustering. We output the clusters found by Pivot as well as the MSFE of the forecast based on this set of clusters. We iterate this procedure 50 times to study how much the MSFE improves based on Pivot Clustering for the starting allocations. The MSFEs of the final clusters and random clusters can be found under MSFEresults.csv in https://github.com/vkovtun84/Pivot-Clustering-of-Demand-Streams. These can also be compared with the MSFEs of the forecast based on the individual (disaggregated) demand streams and the forecast based on fully aggregating the streams.

For the twenty demand streams and models used, the theoretical and estimated MSFEs when forecasting based on individual (disaggregated) streams are 102.1 and 96.2. The theoretical and estimated MSFEs when forecasting based on the fully aggregated streams are 231.3 and 220.6. For the 50 simulations of assigning streams to random clusters (used in the initialization step of Pivot) the average of the theoretical and estimated MSFEs based on the subaggregated random clusters are 202.2 and 194.2. After Pivot Clustering is carried out to obtain a better set of subaggregated clusters in each of the 50 simulations, the averages of the theoretical and estimated MSFEs are 109.4 and 101.0. The various theoretical and estimated MSFEs for the different initializations are provided in Figure 2 and Figure 3. We note that regardless of the initial random assignment of streams to clusters, Pivot Clustering leads to the clustering of streams such that the subaggregated MSFE is very low. In fact, typically Pivot Clustering results in clusters for which the subaggregated MSFE ends up very close to the MSFE obtained when forecasts are based on the individual (disaggregated) streams.

We can compare our results with existing time-series clustering methods. Two distance measures that can be computed for time-series realizations are available in the TSclust package for R, namely AR.PIC and AR.LPC.CEPS. These distances can be used to perform hierarchical clustering such as average-linkage clustering. The final groups determined by these methods lead to MSFEs of 123.4 and 108.8 respectively, higher than those found by Pivot starting from random assignments. We note that the cluster assignments found by these methods can also be used in the initialization of Pivot Clustering, potentially leading to even better clusters.

Since the previous simulations were carried out on only one set of twenty ARMA(1,1) demand models, we should also check the efficacy of Pivot Clustering for other sets of models as well. As such, we consider twenty simulations where within each simulation a new set of twenty demand models is considered. We compare the estimated and theoretical MSFEs of one random assignment of streams to four clusters with the estimated and theoretical MSFEs of the four clusters obtained by Pivot Clustering. In each simulation we also compute the MSFEs that would be found when fully aggregating the streams or when considering forecasts based on individual streams as well as the MSFEs that would be found using the AR.PIC and AR.LPC.CEPS distances for hierarchical clustering streams into four clusters. The results of these simulations are displayed in Figure 4 and Figure 5. We note that if forecasts are to be based on four clusters, the lowest MSFEs are obtained when clusters are formed using Pivot Clustering. Furthermore, Pivot Clustering leads to forecasts whose MSFE is very close to the MSFE of the forecast based on the individual streams in every simulation.

We continue with twenty simulations where in each simulation we consider a separate set of 20 streams being subaggregated into four clusters with 10 random initializations of Pivot Clustering. The means of the various theoretical and estimated MSFEs under different clustering approaches are displayed in Figure 6 and Figure 7. We note again that in every set of twenty streams, the averaged subaggregated MSFEs are very close to the disaggregated MSFEs when averaged for different initial random assignments of streams to clusters.

We continue by assessing how well Pivot clustering does when compared against an exhaustive algorithm which checks all possible assignments of streams to clusters using theoretical MSFE calculations. To do so, we consider twenty simulations where within each simulation we randomly generate 10 ARMA(1,1) streams5 and compute the lowest MSFE possible among all choices of streams to 3 clusters. Furthermore we consider 10 random initializations of Pivot each time new streams are considered. The results are displayed in Figure 8. We note in the first row of that table that for the first simulation of twenty streams 8 out of 10 initializations of our algorithm led Pivot clustering to find the optimal solution. Among the 2 initializations that did not lead to the optimal solution, the ratio of optimal MSFE to the MSFE of the grouping found by pivot is 96.81%. The median was 96.81% while the minimum ratio was 96.24%. In some instances Pivot clustering never found an optimal solution (such as in the 6th simulation), however the average MSFE of the optimal solution was around 99.6% of the MSFE of the groupings found by Pivot. In the worst performance of Pivot (simulation 15), the best possible grouping led to an MSFE that was 80.27464% lower than the MSFE found by Pivot.

Finally, we consider the robustness of the Pivot algorithm to cases where the data generation process is not ARMA. To do so we perform ten simulations where in each simulation we simulate twenty demand stream realizations such that stream

X_{k}

follows an ARFIMA(

0, d_{k}, 0

) model given by

{(1 - B)}^{d_{k}} X_{k, t} = ϵ_{k, t}

(61)

where

- . 4 < d_{k} < . 4

and

C o v (ϵ_{k, t}, ϵ_{j, t})

may be nonzero. Each realization, consisting of 1500 time periods is used to fit an ARMA(5,5) model to compute an estimated one-step-ahead MSFE for the disaggregated series (appearing as a blue dot in Figure 9). Summing the realizations together to fit an ARMA(5,5) model yields an estimated MSFE for the aggregated series (appearing as a black dot in Figure 9). Finally the Pivot algorithm is carried out using five different random initializations of assigning streams to one of four clusters. The MSFEs for the subaggregated random clusters and Pivot clusters appear as red and green dots in Figure 9. We note that in the subaggregated case the number of ARMA models that needs to be estimated is equal to the number of clusters.

As before, we note that Pivot clustering provides a sharp reduction in MSFE compared to random cluster assignments as well as compared to the aggregated case. We also observe that when fitting ARMA models to non-ARMA data it is possible for Pivot clustering to yield clusters which lead to a subaggregated MSFE that is lower than the MSFE using the individual (disaggregated) series. The exact cause of this is unclear, however it is possible it has to do with the extra number of misspecified ARMA models that are fit in the disaggregated case.

Figure 9. Estimated MSFEs when using aggregated, disaggregated, and subaggregated series to forecast one-step-ahead demand for series that are not generated using an ARMA process. A total of 10 simulations is performed where within each simulation twenty separate demand realizations are generated according to the ARFIMA(

0, d, 0

) model with a different d for each realization such that the shocks appearing in the ARFIMA model are contemporaneously correlated. Pivot clustering is carried out for 5 random initialization of assigning the streams to one of four clusters. We compute the estimated MSFEs for the disaggregated series (blue), aggregated series (black), subaggregated clusters generated using random assignment (red) and subaggregated clusters generated using the result of Pivot clustering. We note the supremacy of Pivot clustering in all ten simulations.

0, d, 0

5. Clustering Demand Streams Through Minimizing An Objective Function Based on Subaggregated MSFE

In this section we describe how to determine the optimal assignment of streams to clusters by identifying and minimizing an objective function which computes the overall MSFE given a particular assignment of streams to clusters. We begin by assuming that the desired number of clusters is known to be n. For

α \in 1, 2, \dots n

, let subaggregated cluster series

{C_{α, t}} = {X_{1, t} y_{1, α} + \dots + X_{N, t} y_{N, α}}

, where

y_{i, α} = 1

if stream

{X_{i, t}}

is in cluster

{C_{α, t}}

and 0 otherwise, have the ARMA representation

ϕ_{α}^{★} (B) C_{α, t} = θ_{α}^{★} (B) ϵ_{α, t}^{★} .

(62)

For

α \in 1, 2, \dots n

, the shocks

ϵ_{α, t}^{★}

have covariance matrix

Σ_{ϵ}^{★}

given in equation (42). We define the number of demand streams in cluster

{C_{α, t}}

to be

n_{C_{α}} = \sum_{i = 0}^{N} y_{i, α} > 0

for all

α \in 1, 2, \dots n

. Furthermore if

y_{i, α} = 1

then

y_{i, β} = 0

for all

α \neq β \in 1, 2, \dots n

. For each of

n_{C_{α}}

streams

{X_{i, t}^{C_{α}}}

in cluster

{C_{α, t}}

we adapt the notation of its ARMA representation to be

Φ_{i, α} (B) X_{i, t}^{C_{α}} = Θ_{i, α} (B) ϵ_{α, i, t} .

(63)

Lemma 1.

An optimal set of clusters can be found by minimizing the subaggregated MSFE given by

M S F E s u b a g g = \sum_{α = 1}^{n} \sum_{β = 1}^{n} \sum_{l = 0}^{ℓ} ω_{β, l}^{★} ω_{α, l}^{★} \cdot \sum_{i = 1}^{n_{C_{α}}} \sum_{j = 1}^{n_{C_{β}}} \sum_{k = 0}^{\infty} {\tilde{ψ}}_{α, i, k} {\tilde{ψ}}_{β, j, k} σ_{i j}

(64)

where

{\tilde{ψ}}_{α, i, k}

and

{\tilde{ψ}}_{β, j, k}

are obtained from the equivalence

\frac{ϕ_{α}^{★} (z)}{θ_{α}^{★} (z)} \frac{Θ_{i, α} (z)}{Φ_{i, α} (z)} \equiv \sum_{k = 0}^{\infty} {\tilde{ψ}}_{α, i, k} z^{k}

(65)

and

\frac{ϕ_{β}^{★} (z)}{θ_{β}^{★} (z)} \frac{Θ_{j, β} (z)}{Φ_{j, β} (z)} \equiv \sum_{k = 0}^{\infty} {\tilde{ψ}}_{β, j, k} z^{k} .

(66)

where the key terms are defined in the proof below.

Alternatively, the objective can be stated as finding the optimal set of

{y_{1}, \dots, y_{N}}

such that we minimize

M S F E s u b a g g = \sum_{α = 1}^{n} \sum_{β = 1}^{n} \sum_{l = 0}^{ℓ} ω_{β, l}^{★} ω_{α, l}^{★} \cdot \sum_{i = 1}^{N} \sum_{j = 1}^{N} \sum_{k = 0}^{\infty} y_{i, α} y_{j, β} {\tilde{ψ}}_{α, i, k} {\tilde{ψ}}_{β, j, k} σ_{i j}

(67)

where

ω_{α, l}^{★}

and

ψ_{α, l}^{★}

in equation (70) are obtained through the equivalence

\frac{θ_{α}^{★} (z)}{ϕ_{α}^{★} (z)} = \sum_{l = 0}^{\infty} ψ_{α, l}^{★} z^{l}

(68)

where again the key terms are defined in subsequent proof.

Proof

From equation (7) we note that a forecast based on the clusters would have MSFE given by

M S F E s u b a g g = \sum_{α = 1}^{n} \sum_{β = 1}^{n} \sum_{l = 0}^{ℓ} σ_{α β}^{★} ω_{β, l}^{★} ω_{α, l}^{★}

(69)

where

ω_{α, l}^{★} = \{\begin{matrix} 0 & i < 0 \\ ψ_{α, l}^{★} & l = 0 \\ ω_{α, l - 1}^{★} + ψ_{α, l}^{★} & 0 < l < ℓ + 1 \\ ω_{α, l - 1} + ψ_{α, l}^{★} - ψ_{α, l - ℓ - 1}^{★} & l \geq ℓ + 1 . \end{matrix}

(70)

where

ψ_{α, l}^{★}

is the

l^{t h}

coefficient appearing in the MA(∞) representation of

{C_{α, t}}

with respect to

{ϵ_{α, t}^{★}}

. From equation (41) we note that for any two subaggregated clusters

{C_{α, t}}

and

{C_{β, t}}

consisting of

n_{C_{α}}

and

n_{C_{β}}

streams respectively, the corresponding shock series

{ϵ_{α, t}^{★}}

and

{ϵ_{β, t}^{★}}

, the covariance

E [ϵ_{α, t}^{★} ϵ_{β, t}^{★}]

is expressed by

σ_{α β}^{★} = \sum_{i = 1}^{n_{C_{α}}} \sum_{j = 1}^{n_{C_{β}}} \sum_{k = 0}^{\infty} {\tilde{ψ}}_{α, i, k} {\tilde{ψ}}_{β, j, k} σ_{i j} .

(71)

Therefore the objective is to assign streams to clusters such that we minimize the MSFE

M S F E s u b a g g = \sum_{α = 1}^{n} \sum_{β = 1}^{n} \sum_{l = 0}^{ℓ} ω_{β, l}^{★} ω_{α, l}^{★} \cdot \sum_{i = 1}^{n_{C_{α}}} \sum_{j = 1}^{n_{C_{β}}} \sum_{k = 0}^{\infty} {\tilde{ψ}}_{α, i, k} {\tilde{ψ}}_{β, j, k} σ_{i j}

(72)

where

{\tilde{ψ}}_{α, i, k}

and

{\tilde{ψ}}_{β, j, k}

are obtained from the equivalence

\frac{ϕ_{α}^{★} (z)}{θ_{α}^{★} (z)} \frac{Θ_{i, α} (z)}{Φ_{i, α} (z)} \equiv \sum_{k = 0}^{\infty} {\tilde{ψ}}_{α, i, k} z^{k}

(73)

and

\frac{ϕ_{β}^{★} (z)}{θ_{β}^{★} (z)} \frac{Θ_{j, β} (z)}{Φ_{j, β} (z)} \equiv \sum_{k = 0}^{\infty} {\tilde{ψ}}_{β, j, k} z^{k} .

(74)

Alternatively, we can say that we are finding the optimal set of

{y_{1}, \dots, y_{N}}

such that we minimize

M S F E s u b a g g = \sum_{α = 1}^{n} \sum_{β = 1}^{n} \sum_{l = 0}^{ℓ} ω_{β, l}^{★} ω_{α, l}^{★} \cdot \sum_{i = 1}^{N} \sum_{j = 1}^{N} \sum_{k = 0}^{\infty} y_{i, α} y_{j, β} {\tilde{ψ}}_{α, i, k} {\tilde{ψ}}_{β, j, k} σ_{i j}

(75)

where

ω_{α, l}^{★}

and

ψ_{α, l}^{★}

in equation (70) are obtained through the equivalence

\frac{θ_{α}^{★} (z)}{ϕ_{α}^{★} (z)} = \sum_{l = 0}^{\infty} ψ_{α, l}^{★} z^{l}

(76)

with

θ_{α}^{★} (z)

and

ϕ_{α}^{★} (z)

found using Remark 1 where Laurent polynomials

O (z), P (z)

and

Q (z)

are obtained from the covariance generating function

G_{C_{α}} (z)

given by

\begin{matrix} G_{C_{α}} (z) & = & \sum_{i = 1}^{N} y_{i, α} σ_{i}^{2} \frac{Θ_{i} (z) Θ_{i} (z^{- 1})}{Φ_{i} (z) Φ_{i} (z^{- 1})} \\ + & \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} y_{i, α} y_{j, α} (σ_{i j} \frac{Θ_{i} (z)}{Φ_{i} (z)} \frac{Θ_{j} (z^{- 1})}{Φ_{j} (z^{- 1})} + σ_{i j} \frac{Θ_{i} (z^{- 1})}{Φ_{i} (z^{- 1})} \frac{Θ_{j} (z)}{Φ_{j} (z)}) . \end{matrix}

(77)

□

We note that it is impossible to offer an explicit solution because of the dependence of coefficients

ω_{β, l}^{★}, ω_{α, l}^{★}, {\tilde{ψ}}_{α, i, k}

and

{\tilde{ψ}}_{β, j, k}

on the selection of clusters. In the next section we consider a much simpler case of demand streams being generated by MA(1) models which leads to a much simpler objective function. This allows us to find several theoretical results, culminating in the fact that optimal clusters can be found in this case by identifying streams having the closest MA coefficients with one another.

6. MA(1) Streams

In this section we consider the case that the demand streams being considered are independent MA(1). As we demonstrate below this leads to a simpler objective function. We will use this fact to show how we can use non-linear optimization to assign clusters to streams and to come up with an efficient way to cluster independent MA(1) streams based on segmenting the coefficient space into intervals. The focus on MA(1) streams here allows us to observe that streams with their MA coefficients close to each other should be clustered together. At the end of the section, we provide a lemma on aggregating streams produced by models having identical ARMA coefficients.

Lemma 2.

Suppose

{X_{1, t}}, {X_{2, t}}, \dots, {X_{N, t}}

are MA(1) with MA coefficients

θ_{1}, θ_{2}, \dots, θ_{N}

. Optimal clusters can be found by assigning

y_{j k}

as an indicator variable for stream

X_{j}

being in cluster

C_{k}

such that we minimize

\sum_{k = 1}^{n} \sqrt{(b_{k} + 2 a_{k}) (b_{k} - 2 a_{k})}

(78)

where

b_{k} = \sum_{j = 1}^{N} σ_{j}^{2} (1 + θ_{j}^{2}) y_{j k}

(79)

and

a_{k} = \sum_{j = 1}^{N} σ_{j}^{2} θ_{j} y_{j k} .

(80)

Alternatively, the objective function (78) can be written as

\sum_{k = 1}^{n} \sqrt{\sum_{j = 1}^{N} \sum_{i = 1}^{N} σ_{j}^{2} σ_{i}^{2} {(1 + θ_{j})}^{2} {(1 - θ_{i})}^{2} y_{j k} y_{i k}} .

(81)

Proof uppose

{X_{1, t}}, {X_{2, t}}, \dots, {X_{N, t}}

are MA(1) with MA coefficients

θ_{1}, θ_{2}, \dots, θ_{N}

. Suppose cluster

C_{α, t}

consists of streams

{X_{1, t}}, \dots, {X_{α, t}}

. The covariance generating function of

C_{α, t}

simplifies to

G_{C_{α}} (z) = σ_{1}^{2} (1 + θ_{1} z) (1 + θ_{1} z^{- 1}) + \dots + σ_{α}^{2} (1 + θ_{α} z) (1 + θ_{α} z^{- 1}) .

(82)

In order to determine the variance of the shocks of

C_{α, t}

we need to find the roots of equation (82). Note that it can be rewritten as

(σ_{1}^{2} θ_{1} + \dots + σ_{α}^{2} θ_{α}) z^{- 1} + (σ_{1}^{2} (1 + θ_{1}^{2}) + \dots + σ_{α}^{2} (1 + θ_{α}^{2})) + (σ_{1}^{2} θ_{1} + \dots + σ_{α}^{2} θ_{α}) z .

(83)

We can find the roots of equation (83) using the quadratic formula

\frac{- b \pm \sqrt{b^{2} - 4 a^{2}}}{2 a}

where

\begin{matrix} a & = & σ_{1}^{2} θ_{1} + \dots + σ_{α}^{2} θ_{α} = \sum_{j = 1}^{α} σ_{j}^{2} θ_{j} \end{matrix}

(84)

\begin{matrix} b & = & σ_{1}^{2} (1 + θ_{1}^{2}) + \dots + σ_{α}^{2} (1 + θ_{α}^{2}) = \sum_{j = 1}^{α} σ_{j}^{2} (1 + θ_{j}^{2}) \end{matrix}

(85)

We note that one roots

r_{1}

will be outside the unit circle and from Remark 1 that the variance of the shocks of

C_{α, t}

is equal to

- a r_{1}

. Since

b > 0

, the variance is then given by

σ_{α}^{2} = \frac{b + \sqrt{(b + 2 a) (b - 2 a)}}{2}

(86)

Thus if we are subaggregating into n clusters, the MSFE is given by

M S F E_{s u b a g g} = \frac{1}{2} (\sum_{k = 1}^{n} b_{k} + \sum_{k = 1}^{n} \sqrt{(b_{k} + 2 a_{k}) (b_{k} - 2 a_{k})})

(87)

Since

\sum_{k = 1}^{n} b_{k}

will be the same regardless of choice of clusters, we are minimizing the objective function given by

\sum_{k = 1}^{n} \sqrt{(b_{k} + 2 a_{k}) (b_{k} - 2 a_{k})}

(88)

where

b_{k} = \sum_{j = 1}^{N} σ_{j}^{2} (1 + θ_{j}^{2}) y_{j k}

(89)

and

a_{k} = \sum_{j = 1}^{N} σ_{j}^{2} θ_{j} y_{j k}

(90)

where

y_{j k}

is an indicator variable for stream

X_{j}

being in cluster

C_{k}

Noting that

\begin{matrix} (b_{k} + 2 a_{k}) & = & \sum_{j = 1}^{N} σ_{j}^{2} {(1 + θ_{j})}^{2} y_{j k} \end{matrix}

(91)

\begin{matrix} (b_{k} - 2 a_{k}) & = & \sum_{j = 1}^{N} σ_{j}^{2} {(1 - θ_{j})}^{2} y_{j k} \end{matrix}

(92)

The objective function can also be rewritten as

\sum_{k = 1}^{n} \sqrt{\sum_{j = 1}^{N} \sum_{i = 1}^{N} σ_{j}^{2} σ_{i}^{2} {(1 + θ_{j})}^{2} {(1 - θ_{i})}^{2} y_{j k} y_{i k}} .

(93)

□

We have results (not shown here) where we use equation (93) in a non-linear optimization algorithm to come up with clusters. When using 10 streams with 3 clusters, the approach actually leads to a globally optimal solution almost every time.

We note that equation (93) can also be rewritten as

\sum_{k = 1}^{n} \sqrt{\sum_{X_{i}, X_{j} \in C_{k}} σ_{j}^{2} σ_{i}^{2} {(1 + θ_{j})}^{2} {(1 - θ_{i})}^{2}} .

(94)

where the inner sum is taken over all pairs of streams in each cluster, including pairs of streams with themselves.

If stream

X_{p}

is moved from cluster k to cluster

κ

then the change in the objective function is

\begin{matrix} \sqrt{σ_{p}^{2} {(1 + θ_{p})}^{2} \sum_{X_{i} \in C_{κ}} σ_{i}^{2} {(1 - θ_{i})}^{2} + σ_{p}^{2} {(1 - θ_{p})}^{2} \sum_{X_{i} \in C_{κ}} σ_{i}^{2} {(1 + θ_{i})}^{2}} \\ - & \sqrt{σ_{p}^{2} {(1 + θ_{p})}^{2} \sum_{X_{i} \in C_{k}} σ_{i}^{2} {(1 - θ_{i})}^{2} + σ_{p}^{2} {(1 - θ_{p})}^{2} \sum_{X_{i} \in C_{k}} σ_{i}^{2} {(1 + θ_{i})}^{2}} \end{matrix}

(95)

which provides an alternative way to cluster streams by identifying and moving the stream from one cluster to another which yields the largest drop in the objective function.

6.1. Aggregate of Two MA(1) Streams

In this subsection we consider two MA(1) streams whose variance of shocks is unitary. We demonstrate that the MA coefficient6 of their aggregate process is always between the two MA coefficients of the individual streams. Furthermore we show that as the two coefficients of the two MA(1) streams are moved further apart from each other the variance of the shocks appearing in the Aggregated process increases. These two facts imply that if we are studying N individual MA(1) streams (with unit shock variance) and would like to cluster them into n clusters then the globally optimal clustering assignment will cluster the streams along intervals. That is, the assignment will have split the clusters into groups of streams whose MA coefficients are next to each other in a sorted arrangement. This implies that an efficient algorithm for obtaining a globally optimal cluster assignment consists of arranging the MA coefficients in increasing order and checking all possible "interval" clusters, without worrying that if two streams are clustered together, another stream with an MA coefficient between the two is assigned to a different cluster. This algorithm would need to check

(\binom{N}{n})

possible cluster assignments to find the globally optimal arrangement.

Theorem 1.

Consider two streams whose ARMA representations are given by

X_{1, t} = (1 + θ_{1} B) ϵ_{1, t}

X_{2, t} = (1 + θ_{2} B) ϵ_{2, t}

with

v a r (ϵ_{1, t}) = v a r (ϵ_{2, t}) = 1

and

σ_{12} = 0

and

θ_{1} < θ_{2}

. Note that

θ_{1}

and

θ_{2}

are allowed to equal 0.

The Aggregated process

{X_{1, t} + X_{2, t}}

is described by the MA(1) model

X_{1, t} + X_{2, t} = (1 + θ B) ϵ_{t}

such that

θ_{1} < θ < θ_{2}

Proof. Note that the covariance generating function of the aggregate process is

G_{S} (z) = (θ_{1} + θ_{2}) z^{- 1} + (2 + θ_{1}^{2} + θ_{2}^{2}) + (θ_{1} + θ_{2}) z .

(96)

As long as7

θ_{1} \neq - θ_{2}

, this polynomial has roots

a_{1}

and

1 / a_{1}

. Suppose

a_{1}

is inside the unit circle, then according to Remark 1,

θ = - a_{1}

and

v a r (ϵ_{t}) = σ_{ϵ}^{2} = - (θ_{1} + θ_{2}) (1 / a_{1})

We can note that the Laurent polynomial in Equation (96) has roots given by

\frac{- (2 + θ_{1}^{2} + θ_{2}^{2}) \pm \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2 (θ_{1} + θ_{2})}

(97)

This implies that

a_{1} = \frac{- (2 + θ_{1}^{2} + θ_{2}^{2}) + \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2 (θ_{1} + θ_{2})}

(98)

and

θ = \frac{(2 + θ_{1}^{2} + θ_{2}^{2}) - \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2 (θ_{1} + θ_{2})}

(99)

In the remainder of this section we will prove that

θ_{1} < \frac{(2 + θ_{1}^{2} + θ_{2}^{2}) - \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2 (θ_{1} + θ_{2})} < θ_{2}

(100)

Suppose first that

θ_{1} + θ_{2} > 0

Then we can rewrite equation (100) as

2 θ_{1}^{2} + 2 θ_{2} θ_{1} < (2 + θ_{1}^{2} + θ_{2}^{2}) - \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}} < 2 θ_{2}^{2} + 2 θ_{2} θ_{1}

(101)

or equivalently,

θ_{1}^{2} - θ_{2}^{2} + 2 θ_{2} θ_{1} - 2 < - \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}} < θ_{2}^{2} - θ_{1}^{2} + 2 θ_{1} θ_{2} - 2

(102)

θ_{1}^{2} - θ_{2}^{2} - 2 θ_{2} θ_{1} + 2 < \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}} < θ_{2}^{2} - θ_{1}^{2} - 2 θ_{1} θ_{2} + 2

(103)

Noting that the left and right-hand sides of the inequality are always larger than zero, (103) holds if and only if the following inequality holds as well

{(θ_{1}^{2} - θ_{2}^{2} - 2 θ_{2} θ_{1} + 2)}^{2} < {(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2} < {(θ_{2}^{2} - θ_{1}^{2} - 2 θ_{1} θ_{2} + 2)}^{2} .

(104)

Labeling the three sides of this inequality as

A < B < C

, we observe that

\begin{matrix} A & = & θ_{1}^{4} - θ_{1}^{2} θ_{2}^{2} - 2 θ_{1}^{3} θ_{2} + 2 θ_{1}^{2} - θ_{1}^{2} θ_{2}^{2} + θ_{2}^{4} + 2 θ_{2}^{3} θ_{1} - 2 θ_{2}^{2} \\ - & 2 θ_{1}^{3} θ_{2} + 2 θ_{2}^{3} θ_{1} + 4 θ_{1}^{2} θ_{2}^{2} - 4 θ_{2} θ_{1} + 2 θ_{1}^{2} - 2 θ_{2}^{2} - 4 θ_{2} θ_{1} + 4 \\ B & = & 4 + 2 θ_{1}^{2} + 2 θ_{2}^{2} + 2 θ_{1}^{2} + θ_{1}^{4} + θ_{1}^{2} θ_{2}^{2} + 2 θ_{2}^{2} + θ_{1}^{2} θ_{2}^{2} + θ_{2}^{4} - 4 (θ_{1}^{2} + 2 θ_{1} θ_{2} + θ_{2}^{2}) \\ C & = & θ_{2}^{4} - θ_{2}^{2} θ_{1}^{2} - 2 θ_{2}^{3} θ_{1} + 2 θ_{2}^{2} - θ_{2}^{2} θ_{1}^{2} + θ_{1}^{4} + 2 θ_{1}^{3} θ_{2} - 2 θ_{1}^{2} \\ - & 2 θ_{2}^{3} θ_{1} + 2 θ_{1}^{3} θ_{2} + 4 θ_{2}^{2} θ_{1}^{2} - 4 θ_{1} θ_{2} + 2 θ_{2}^{2} - 2 θ_{1}^{2} - 4 θ_{1} θ_{2} + 4 \end{matrix}

(105)

Removing equivalent terms and combining like terms, we observe

\begin{matrix} 2 θ_{1}^{2} θ_{2}^{2} - 4 θ_{1}^{3} θ_{2} + 4 θ_{1}^{2} + 4 θ_{2}^{3} θ_{1} - 4 θ_{2}^{2} - 8 θ_{1} θ_{2} \\ < & 2 θ_{1}^{2} θ_{2}^{2} - 8 θ_{1} θ_{2} \\ < & θ_{2}^{2} θ_{1}^{2} - 4 θ_{2}^{3} θ_{1} + 4 θ_{2}^{2} + 4 θ_{1}^{3} θ_{2} - 4 θ_{1}^{2} - 8 θ_{2} θ_{1} \end{matrix}

which can be rewritten as

- 4 θ_{1}^{3} θ_{2} + 4 θ_{1}^{2} + 4 θ_{1} θ_{2}^{3} - 4 θ_{2}^{2} < 0 < - 4 θ_{1} θ_{2}^{3} + 4 θ_{2}^{2} + 4 θ_{1}^{3} θ_{2} - 4 θ_{1}^{2} .

(106)

Noting that the left and right-hand sides of this inequality are additive inverses, we see that this inequality holds8 and therefore inequality (100) holds.

Finally, if

θ_{1} + θ_{2} < 0

in (100) then the direction of the inequalities is reversed in (101) and a similar sequence of steps would lead us to observe

- 4 θ_{1}^{3} θ_{2} + 4 θ_{1}^{2} + 4 θ_{1} θ_{2}^{3} - 4 θ_{2}^{2} > 0 > - 4 θ_{1} θ_{2}^{3} + 4 θ_{2}^{2} + 4 θ_{1}^{3} θ_{2} - 4 θ_{1}^{2}

(107)

and by the same argument we see that (100) holds and the theorem is proved. □.

Theorem 2.

Consider two streams whose ARMA representations are given by

X_{1, t} = (1 + θ_{1} B) ϵ_{1, t}

X_{2, t} = (1 + θ_{2} B) ϵ_{2, t}

with

σ_{1}^{2} = σ_{2}^{2} = 1

and

σ_{12} = 0

and

θ_{1} < θ_{2}

. Note that

θ_{1}

and

θ_{2}

may be 0.

As the distance between

θ_{1}

and

θ_{2}

increases,

v a r (ϵ_{t}) = σ_{ϵ}^{2}

increases.

Proof

From Equation (96) we see that the root

1 / a_{1}

G_{S} (z)

, which is outside the unit circle, is given by

\frac{- (2 + θ_{1}^{2} + θ_{2}^{2}) - \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2 (θ_{1} + θ_{2})} .

(108)

Since

σ_{ϵ}^{2} = - (θ_{1} + θ_{2}) / a_{1}

, we have

σ_{ϵ}^{2} = \frac{(2 + θ_{1}^{2} + θ_{2}^{2}) + \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}}{2}

(109)

To show that this is increasing in

θ_{2}

, consider the derivative of the above with respect to

θ_{2}

\frac{\partial σ_{ϵ}^{2}}{\partial θ_{2}} = θ_{2} + \frac{θ_{2} (2 + θ_{1}^{2} + θ_{2}^{2}) - 2 (θ_{1} + θ_{2})}{\sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}} .}

(110)

We need to show that (110) is larger than zero, thus consider

\frac{\partial σ_{ϵ}^{2}}{\partial θ_{2}} = θ_{2} + \frac{θ_{2} (2 + θ_{1}^{2} + θ_{2}^{2}) - 2 (θ_{1} + θ_{2})}{\sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}} > 0

(111)

or equivalently

θ_{2} (2 + θ_{1}^{2} + θ_{2}^{2}) - 2 (θ_{1} + θ_{2}) > - θ_{2} \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}

(112)

which simplifies to

2 (θ_{1} + θ_{2}) - θ_{2} (2 + θ_{1}^{2} + θ_{2}^{2}) < θ_{2} \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}}

(113)

2 θ_{1} - θ_{2} θ_{1}^{2} - θ_{2}^{3} < θ_{2} \sqrt{{(2 + θ_{1}^{2} + θ_{2}^{2})}^{2} - 4 {(θ_{1} + θ_{2})}^{2}} .

(114)

We will refer to the left and right hand sides of the inequality in (114) as

L H S

and

R H S

. Note that squaring both yields

\begin{matrix} L H S^{2} & = & 4 θ_{1}^{2} - 4 θ_{1}^{3} θ_{2} - 4 θ_{1} θ_{2}^{3} + θ_{1}^{4} θ_{2}^{2} + 2 θ_{1}^{2} θ_{2}^{4} + θ_{2}^{6} \\ R H S^{2} & = & 4 θ_{2}^{2} + 4 θ_{1}^{2} θ 2^{2} + θ_{1}^{4} θ_{2}^{2} + 2 θ_{1}^{2} θ_{2}^{4} + θ_{2}^{6} - 4 θ_{1}^{2} θ_{2}^{2} - 8 θ_{1} θ_{2}^{3} \end{matrix}

(115)

Note that furthermore

L H S^{2} - R H S^{2} = 4 θ_{1}^{2} - 4 θ_{2}^{2} - 4 θ_{1}^{3} θ_{2} + 4 θ_{1} θ_{2}^{3} = 4 (1 - θ_{1} θ_{2}) (θ_{1}^{2} - θ_{2}^{2})

(116)

Suppose first that

θ_{1} < θ_{2} = 0

. Note that (114) reduces to

2 θ_{1} < 0

, which holds, and therefore (111) holds as well.

Next suppose that

0 = θ_{1} < θ_{2}

. Note that (114) reduces to

- θ_{2}^{3} < θ_{2} \sqrt{{(2 + θ_{2}^{2})}^{2} - 4 θ_{2}^{2}}

(117)

which holds as well since the left-hand side is negative in this case.

In the remainder of the proof we assume that

θ_{1} \neq 0

and

θ_{2} \neq 0

. Consider the case that

| θ_{2} | > | θ_{1} |

. Note that this implies that

θ_{2}

>0 (since

θ_{2} > θ_{1}

) and that (116) is less than zero. Thus in this case

L H S^{2} < R H S^{2}

and

L H S < R H S

in (114). Therefore (111) holds in this case as well.

Now suppose that

| θ_{1} | > | θ_{2} |

and that

θ_{2} > 0

. The former implies that

L H S^{2} - R H S^{2} > 0

and the latter implies that

R H S > 0

. Therefore

L H S < 0

and therefore

L H S < R H S

. Therefore (111) holds in this case as well.

Finally suppose that

| θ_{1} | > | θ_{2} |

and that

θ_{2} < 0

. The former again implies that

L H S^{2} - R H S^{2} > 0

while the latter implies that

R H S < 0

. Therefore

L H S < R H S

and (111) holds in all cases. □.

7. Demand Streams Produced by Identical ARMA models

Note that Theorem 1 and Theorem 2 imply that the prescribed algorithm at the top of the previous subsection always leads to an optimal solution. The following lemma establishes that in the event that two streams are generated by the same ARMA model, the aggregate will also follow the same ARMA model. Therefore, we can greatly reduce the dimensionality of the number of streams that need to be assigned to clusters by first aggregating demand streams from equivalent models.

Lemma 3.

Consider two sequences

{X_{1, t}}

and

{X_{2, t}}

that have the same ARMA representation with respect to Wold shock sequences

{ϵ_{1, t}}

and

{ϵ_{2, t}}

given by

\begin{matrix} Φ^{★} (B) X_{1, t} & = & Θ^{★} (B) ϵ_{1, t} \end{matrix}

(118)

\begin{matrix} Φ^{★} (B) X_{2, t} & = & Θ^{★} (B) ϵ_{2, t} \end{matrix}

(119)

such that the variance of the shock sequences are

σ_{1}^{2}

and

σ_{2}^{2}

with covariance

σ_{12}

The aggregate

{S_{t} = X_{1, t} + X_{2, t}}

also has the same ARMA representation with respect to its Wold shocks

{ϵ_{t}}

given by

Φ^{★} (B) S_{t} = Θ^{★} (B) ϵ_{t}

(120)

such that the variance of

{ϵ_{t}}

is given by

σ_{1}^{2} + σ_{2}^{2} + 2 σ_{12}

Proof. From Remark 1 we note that the ARMA representation of

{S_{t}}

is given by

Φ (B) S_{t} = Θ (B) ϵ_{t}

(121)

such that

Θ (z) = \prod_{i = 1}^{m} (1 - a_{i} z)

where

{a_{i}}

are the roots of

O (z) P (z)

on or inside the unit circle and

Φ (z) = \prod_{i = 1}^{n} (1 - b_{i} z)

where

{b_{i}}

are the roots of

Q (z)

inside the unit circle with

O (z)

P (z)

and

Q (z)

are obtained from the covariance generating function

G_{S} (z)

given by

G_{S} (z) = (σ_{1}^{2} + σ_{2}^{2}) \frac{Θ^{★} (z) Θ^{★} (z^{- 1})}{Φ^{★} (z) Φ^{★} (z^{- 1})} + 2 σ_{12} \frac{Θ^{★} (z) Θ^{★} (z^{- 1})}{Φ^{★} (z) Φ^{★} (z^{- 1})}

(122)

as per (77). This can further be simplified as

G_{S} (z) = (σ_{1}^{2} + σ_{2}^{2} + 2 σ_{12}) \frac{Θ^{★} (z) Θ^{★} (z^{- 1})}{Φ^{★} (z) Φ^{★} (z^{- 1})}

(123)

and therefore

Φ (z) = Φ^{★} (z)

and

Θ (z) = Θ^{★} (z)

and the result is proved. □

Lemma 3 shows us that if we have n demand sequences

X_{1, t}, \dots, X_{n, t}

, generated by models with the same ARMA coefficients with respect to their Wold shocks, their aggregate will have the same ARMA coefficients. Therefore if the customer base of a firm is comprised of many demand streams having the same ARMA representation, it is possible to greatly reduce the number of streams that need to be considered for clustering by first aggregating these equivalent streams.

8. Extensions and Other Questions

In this paper we compare theoretical MSFEs of a firm forecasting its leadtime demand based on disaggregated (individual) demand streams and subaggregated clusters formed from those streams. We highlight examples that illustrate that the MSFE based on subaggregates need not be much larger than the MSFE based on the disaggregated streams as long as those clusters are well-formed. We propose a Pivot algorithm to form clusters which minimize the MSFE among all cluster assignments. We end with some theoretical results when the demand streams are generated by MA(1). Here we show that clusters resulting in the lowest MSFE are formed by grouping streams by the proximity of their MA coefficient.

The MA(1) case hints that in a general ARMA case, "best" clusters would be formed based on proximity of the ARMA coefficients between models generating the various streams (or equivalently based on the proximity of roots of the AR and MA polynomials). Alternatively, best cluster assignments may result from grouping streams with most similar coefficients appearing in the MA(∞) representation. Future work can be done to establish the best approach.

Our current theoretical approach based on general ARMA models is limited in that root-finding algorithms are unstable once the degree of a polynomial gets too large. In our study, this begins to occur when we consider the aggregate of around twenty streams with at least one AR coefficient. It is possible to greatly reduce the dimensionality however by first aggregating demand streams that are produced by identical or nearly-identical ARMA models. This is another direction for future research.

References

Kohn, R. When is an Aggregate of a Time Series Efficiently Forecast by its Past. Journal of Econometrics 1982, 18, 337–349. [Google Scholar] [CrossRef]
D. M. Dunn, W.H.W.; Dechaine, T.L. Aggregate versus Subaggregate Models in Local Area Forecasting. Journal of the American Statistical Association 1976, 71, 68–71. [CrossRef]
Benjaafar, S.; ElHafsi, M.; de Véricourt, F. Demand Allocation in Multiple-Product, Multiple-Facility, Make-to-Stock Systems. Management Science 2004, 50, 1431–1448. [Google Scholar] [CrossRef]
Bernstein, F.; Modaresi, S.; Sauré, D. A Dynamic Clustering Approach to Data-Driven Assortment Personalization. Management Science 2018, 65, 1949–2443. [Google Scholar] [CrossRef]
Ma, S.; Fildes, R.; Huang, T. The Case of SKU Retail Sales Forecasting with Intra- and Inter-Category Promotional Information. Lancaster Management School, working paper 2014. [Google Scholar]
Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed.; Springer-Verlag, 1991.
Kovtun, V.; Giloni, A.; Hurvich, C. The Value of Sharing Disaggregated Information in Supply Chains. European Journal of Operational Research 2019, 277, 469–478. [Google Scholar] [CrossRef]
Giloni, A.; Hurvich, C.; Seshadri, S. Forecasting and Information Sharing in Supply Chains Under ARMA Demand. IIE Transactions 2014, 46, 35–54. [Google Scholar] [CrossRef]
Sayed, A.H.; Kailath, T. A Survey of Spectral Factorization Methods. Numerical Linear Algebra with Applications 2001, 8, 467–496. [Google Scholar] [CrossRef]
Yu, L. Evaluation and Analysis of Electric Power in China Based on the ARMA Model. Hindawi 2022. [Google Scholar] [CrossRef]
Rostami-Tabar, B.; Babai, M.Z.; Syntetos, A.; Ducq, Y. Forecasting aggregate ARMA(1,1) Demands: theoretical analysis of top-down versus bottom-up. International Conference on Industrial Engineering and Systems Management (Proceedings) 2013. [Google Scholar]

1	(This hold for the theoretical case when model coefficients are known and need not be estimated).
2	The situation is really identical for any player in the supply chain that might be observing multiple demand streams, where information sharing does not take place
3	In this case the reduction in MSFE appears the same (regardless of leadtime), however this is due to the series being generated by MA(1) models. For higher order ARMA models, the reduction in MSFE may be dependent on leadtime.
4	The AR and MA degrees were chosen with the understanding that these degrees increase with the number of streams subaggregated into a particular cluster, while trying to limit the complexity of the models being estimated.
5	We reduced the number of streams and clusters here due to the fact that an exhaustive algorithm requires $O (k^{N})$ iterative steps to check all possible cluster assignments where k is the number of clusters and N is the number of streams. We note that Pivot clustering has a complexity of $O (k N)$ in the event that each stream is only allowed to change clusters once.
6	The aggregate of two MA(1) streams is always MA(1)
7	If $θ_{1} = - θ_{2}$ , then $θ = 0$ and the result still holds.
8	The given direction of the inequalities must hold, otherwise $θ_{1} > θ_{2}$ in (100)

Figure 1. Runtimes of Pivot Clustering for Assigning Streams to Four Clusters Using Theoretical and Estimated MSFEs. The graph contains the average runtime of Pivot clustering for each specific number of streams being assigned to four clusters. That is, for N streams (between 10 and 20) we consider twenty randomly assigned ARMA(1,1) models for each stream. Pivot Clustering (with theoretical and estimated MSFE) is then used to obtain the four optimal clusters. The average runtime of Pivot is approximately 63 seconds when twenty streams are assigned to clusters using theoretical MSFE and 40 seconds using estimated MSFE.

Figure 2. Theoretical Subaggregated MSFE for Random Initialization of Pivot. Theoretical MSFEs are computed on the four clusters obtained by Pivot Clustering for different random initializations. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).

Figure 3. Estimated Subaggregated MSFE for Random Initialization of Pivot. Estimated MSFEs are computed for different initializations of Pivot Clustering. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).

Figure 4. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.

Figure 5. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.

Figure 6. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.

Figure 7. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.

Figure 8. Twenty simulations are carried out where within each simulation we select a new set of ARMA(1,1) coefficients for each of 10 streams. The streams are then clustered using Pivot clustering using theoretical MSFE based on 10 different starting groups. We also obtain the optimal (minimum MSFE) clustering assignment based on an exhaustive search of all possible assignments of streams to clusters. Each row corresponds to a new simulation. The three columns contain the mean, median and minimum ratios of global optimal MSFE to the MSFE obtained by Pivot for the different initializations in the event that Pivot clustering did not find the optimal solution.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

Pivot Clustering to Minimize Error in Forecasting Aggregated ARMA Demand

Abstract

1. Introduction

2. Model Framework

2.1. Forecasting Using the Disaggregated (Individual) Sequences { X 1 , τ } τ = − ∞ t , … , { X n , τ } τ = − ∞ t

2.2. ARMA Representation of a Summed Sequence { S τ = X 1 , τ + … + X s , τ } τ = − ∞ t

2.3. Forecasting Using the Fully Aggregated Sequence { D τ } τ = − ∞ t For A General Leadtime

2.4. Forecasting Using Subaggregated Sequences

2.5. Example

3. The Benefit of Forecasting Using Subaggregated Clusters

4. TS Clustering Algorithms and Pivot Clustering, Empirical Evaluation

5. Clustering Demand Streams Through Minimizing An Objective Function Based on Subaggregated MSFE

6. MA(1) Streams

6.1. Aggregate of Two MA(1) Streams

7. Demand Streams Produced by Identical ARMA models

8. Extensions and Other Questions

References

MDPI Initiatives

Important Links

Subscribe

2.1. Forecasting Using the Disaggregated (Individual) Sequences ${X_{1, τ}}_{τ = - \infty}^{t}, \dots, {X_{n, τ}}_{τ = - \infty}^{t}$

2.2. ARMA Representation of a Summed Sequence ${S_{τ} = X_{1, τ} + \dots + X_{s, τ}}_{τ = - \infty}^{t}$

2.3. Forecasting Using the Fully Aggregated Sequence ${D_{τ}}_{τ = - \infty}^{t}$ For A General Leadtime