Preprint
Article

Pivot Clustering to Minimize Error in Forecasting Aggregated ARMA Demand

This version is not peer-reviewed.

Submitted:

19 September 2023

Posted:

10 October 2023

Read the latest preprint version here

A peer-reviewed article of this preprint also exists.

Abstract
In this paper we compare the effects of forecasting demand using individual (disaggregated) components versus first aggregating the components either fully or into several clusters. Demand streams are assumed to follow autoregressive moving average (ARMA) processes. Using individual demand streams will always lead to a superior forecast compared to any aggregates, however we show that if several aggregated clusters are formed in a structured manner then these subaggregated clusters will lead to a forecast with minimal increase in mean-squared forecast error. We show this result based on theoretical MSFE obtained directly from the models generating the clusters as well as estimated MSFE obtained directly from simulated demand observations. We suggest a pivot-algorithm, that we call Pivot Clustering, to create these clusters. We also provide theoretical results to investigate sub-aggregation, including for special cases such as, aggregating MA(1) streams and ARMA streams with similar or same parameters.
Keywords: 
Subject: 
Business, Economics and Management  -   Econometrics and Statistics

1. Introduction

Modern-day technologies not only permit firms to accurately track their point of sales data and lost sales data but also gather more granular data. These data streams deluge firms with information which either can be aggregated for planning purposes or considered in its entirety or follow an in-between approach. In this paper we analyze a model in which a retailer is faced with exactly the same choices and provide guidelines for combining the data for the purpose of forecasting demand.
Consider a retailer who has access to its individual customer’s demand streams. Assume that each of these demand streams are ARMA with possibly contemporaneously correlated shock sequences. The primary contribution of this research is to quantify to what extent such a retailer would benefit from forecasting each of the individual streams as opposed to the aggregate. In general, retailers forecast their aggregate demand stream since historically the retailer may only have accurate aggregate demand information and the forecasting of the individual customer demand streams is often thought of as being cumbersome and time consuming. We demonstrate that a retailer observing multiple streams of ARMA demand can drastically reduce its mean squared forecasting error (MSFE) by forecasting the individual demand streams as opposed to just the aggregate demand stream as noted in [4].
Although the retailer’s MSFE is never lower by forecasting aggregate ARMA demand compared to the individual ARMA streams, there are cases when the individual ARMA streams do not reduce the MSFE. The primary situation where this occurs, as is discussed in the paper, is when various demand streams have identical ARMA parameter values. For the in-between case, our results demonstrate that the retailer’s MSFE under aggregate forecasting can be greatly reduced if the retailer forecasts various clusters of aggregated demand streams. In other words, retailers can make use of data mining and other clustering approaches in order to generate clusters of similar customers and their demand streams. We show that such clustering methods significantly enhance forecast accuracy.
Many researchers and practitioners focus on the need to determine clusters of similarly situated customers in order to create and provide customized and/or personalized products. This type of clustering is generally performed on specific characteristics that customers possess (see for example, [1] and the references within). On the other hand, our focus is on forecasting demand for a product by a firm’s customers, recognizing that these customers may have different preferences and hence differing demand. We demonstrate that using forecasting efficiency is an alternative approach to generating the clusters or segments of the firm’s customers. As we describe below, from the demand forecasting perspective, the information contained within the individual demand streams provides the optimal forecast (in terms of minimizing the MSFE and hence inventory related costs) for product demand. Nonetheless, there has been research on the use of clustering methods within a forecasting environment when customer demand data is high dimensional (see for example, [6]).
As opposed to generating clusters based upon customer preferences and customer demographics, we explain how clusters can be generated explicitly from the individual time series structure of the individual demand streams or customers. Even though, it is always optimal from a forecasting perspective to use the individual streams, clusters of similar customer streams may be very helpful to the firm for other reasons as described above. Future empirical work would be necessary to investigate to what extent and in what contexts, clusters generated based upon time series structure of the demand streams correlate to clusters based upon other customer preferences. In such a case where there exists such a relationship, firms could use clustering for simplifying their demand forecasting while identifying groups of customers to receive personalized products.
The purpose of this paper is to demonstrate that clustering based upon time series structure can be utilized within demand forecasting that is superior to forecasting aggregate demand and nearly as good as forecasting the individual demand streams. The structure of our paper is as follows. In the next section, we describe the demand framework and supply chain setting of our research problem, as well as the way that theoretical MSFE computations are determined for the various forecasts (using aggregated demand processes) included herein. In Section 3, we illustrate (through example) that there exists a particular set of subaggregated clusters which results in an MSFE that is close to the MSFE obtained from using disaggregated streams and much lower than the MSFE obtained from the fully aggregated sequence. In Section 4, we describe how to cluster ARMA demand series using Pivot Clustering and how this compares to other clustering methods. In Section 5, we describe an objective function that can be minimized to obtain an optimal assignment of streams to clusters in terms of MSFE reduction. Finally, we obtain theoretical results on how MA(1) demand streams can be clustered in the most efficient way possible to reduce the resulting subaggregated MSFE in Section 6.

2. Model Framework

We consider a retailer with possibly many large customers. In general, retailers forecast their aggregate demand stream since the forecasting of the individual customer demand streams is often thought of as being cumbersome and time consuming. We demonstrate that a retailer observing multiple streams of ARMA demand can drastically reduce its Mean Squared Forecasting Error by forecasting the individual demand streams or aggregated clusters of similar demand streams. Specifically, consider a retailer that observes multiple demand streams for a single product { X 1 , t } , { X 2 , t } , , { X N , t } . Each demand stream { X k } is ARMA with respect to a sequence of shocks { ϵ k , t } given by
Φ k ( B ) X k , t = Θ k ( B ) ϵ k , t
such that { X k , t } is invertible and causal with respect to { ϵ k , t } (see Brockwell and Davis, page 77 for a definition and discussion about causal and invertible ARMA models). We denote the variance of each shock sequence σ k 2 = E [ ϵ k 2 ] . Furthermore we note that the shock sequences are potentially contemporaneously correlated with σ i j = E [ ϵ i , t ϵ j , t ] . In general, this set up guarantees that the shocks ϵ k , t are the retailer’s Wold shocks and that the one-step-ahead MSFE of one-step-ahead leadtime demand (when using the disaggregated (individual) streams) is the sum of the elements in the covariance matrix Σ ϵ where Σ i j = σ i j such that σ k 2 = σ k k (see Equation (10).
The focus of this paper is evaluating the difference in one-step-ahead MSFEs at time t when the forecast of leadtime demand, given by i = 1 + 1 ( X 1 , t + i + X 2 , t + i + + X N , t + i ) , is based on the disaggregated (individual) sequences { X 1 , τ } τ = t , { X 2 , τ } τ = t , , { X N , τ } τ = t versus when the forecast is based on the aggregated sequence { D τ = X 1 , τ + X 2 , τ + + X N , τ } τ = t versus when the forecast is based on subaggregated clusters { C 1 , τ } τ = t , { C 2 , τ } τ = t , , { C n , τ } τ = t where C k , τ = X 1 , τ C k + + X n C k , τ C k and the superscript C k indicates that { X t } belongs to cluster C k . Studying one-step-ahead MSFEs is mathematically simpler than those for general leadtimes since the former does not depend on model parameters.
Our problem is related to the one posed by [5] where a two-stage supply chain was considered with the retailer observing two demand streams. The focus of that paper was in evaluating information sharing between the retailer and supplier in a situation where the retailer forecasts each demand stream separately. Here we show the benefit to the retailer1 in determining the separate forecasts, while considering the existence of (possibly) more than two demand streams. Kohn [4] was the first to identify conditions under which using the separate ARMA streams leads to a better forecast than using the aggregate sequence, however he did not determine the MSFE in the two cases.
In this paper we extend the results of [5] by providing formulas for computing MSFEs under the possibility of more than two streams and a general leadtime in the situation where a player’s (retailer’s) forecast can only be based on their Wold shocks (see [2] pp 187-188 for a description of a Wold decomposition of a time series). We also describe how a retailer would forecast its demand by identifying clusters of similar demand streams and then forecasting each cluster after it is aggregated. Assuming general ARMA demand streams, we provide a pivot algorithm, which we call Pivot Clustering, for identifying a locally optimal assignment of streams into a fixed number of clusters. The algorithm will often find the best possible assignment. We also describe a fast clustering algorithm for assigning independent MA(1) streams to clusters which results in a globally optimal assignment. Our results show that after the algorithms are carried out, much of the benefit of forecasting individual streams can be obtained by forecasting the aggregated clusters.

2.1. Forecasting Using the Disaggregated (Individual) Sequences { X 1 , τ } τ = t , , { X n , τ } τ = t

In this section we are interested in forecasting i = 1 + 1 ( X 1 , t + i + X 2 , t + i + + X N , t + i ) when the forecast is based on the individual sequences { X 1 , τ } τ = t , { X 2 , τ } τ = t , , { X N , τ } τ = t . The forecasting contained here is a textbook multivariate time-series result along with the propagation described in [3] as seen in (7) .
Thus we consider the ARMA processes given by
Φ 1 ( B ) 0 0 0 Φ 2 ( B ) 0 0 0 0 0 0 0 Φ N ( B ) X 1 , t X 2 , t X N , t = Θ 1 ( B ) 0 0 0 Θ 2 ( B ) 0 0 0 0 0 0 0 Θ N ( B ) ϵ 1 , t ϵ 2 , t ϵ N , t
with covariance matrix of the shock sequences given by
Σ ϵ = σ 1 2 σ 12 σ 1 N σ 21 σ 2 2 σ 2 N σ 3 N σ N 1 σ N 2 σ N 2
such that σ i j = σ j i .
The total MSFE of the BLF is given as the sum of the elements in the NxN matrix
E i = 1 + 1 X 1 , t + i i = 1 + 1 X 1 , t + i ^ , , i = 1 + 1 X N , t + i i = 1 + 1 X N , t + i ^ i = 1 + 1 X 1 , t + i i = 1 + 1 X 1 , t + i ^ , , i = 1 + 1 X N , t + i i = 1 + 1 X N , t + i ^
which takes the form
E i = 1 + 1 X 1 , t + i i = 1 + 1 X 1 , t + i ^ 2 E i = 1 + 1 X 1 , t + i i = 1 + 1 X 1 , t + i ^ i = 1 + 1 X N , t + i i = 1 + 1 X N , t + i ^ E i = 1 + 1 X N , t + i i = 1 + 1 X N , t + i ^ i = 1 + 1 X 1 , t + i i = 1 + 1 X 1 , t + i ^ E i = 1 + 1 X N , t + i i = 1 + 1 X N , t + i ^ 2
We note that the diagonal elements are those obtained in [3]. Specifically,
E i = 1 + 1 X k , t + i i = 1 + 1 X k , t + i ^ 2 = σ k 2 i = 0 ω k , i 2
where
ω k , i = 0 i < 0 ψ k , i i = 0 ω k , i 1 + ψ k , i 0 < i < + 1 ω k , i 1 + ψ k , i ψ k , i 1 i + 1 .
where ψ k , i is the i t h coefficient appearing in the MA() representation of { X k , t } with respect to { ϵ k , t } . That is 1 + ψ k , 1 z + ψ k , 2 z 2 + = Ψ k ( z ) and
X k , t = Ψ k ( B ) ϵ k , t .
We note that Ψ k ( z ) = Θ k ( z ) Φ k ( z ) .
Remark 1. 
The off-diagonal elements of (5) can be summarized as
E i = 1 + 1 X j , t + i i = 1 + 1 X j , t + i ^ i = 1 + 1 X k , t + i i = 1 + 1 X k , t + i ^ = σ j k i = 0 ω j , i ω k , i
Thus the MSFE of the BLF of leadtime demand when using the individual sequences
{ X 1 , τ } τ = t , { X 2 , τ } τ = t , , { X N , τ } τ = t
is given by
M S F E i n d = k = 1 N j = 1 N i = 0 σ k j ω j , i ω k , i
where we observe the equivalence σ k k = σ k 2 .
Thus we compare (10) with (27) below to determine the value of using the individual sequences as opposed to the aggregate.

2.2. ARMA Representation of a Summed Sequence { S τ = X 1 , τ + + X s , τ } τ = t

In this subsection we determine the ARMA representation and the variance of the shocks appearing in the Wold representation of a series { S t } , which is the sum of several ARMA streams given by { S t } = { X 1 , t + X 2 , t + + X s , t } . This will allow us to determine the MSFE when forecasts are based on the fully aggregated series as well as when the forecasts are based on subaggregated clusters. In order to obtain the ARMA representation we first need to obtain the spectral density f S ( λ ) and the covariance generating function G S ( z ) of { S t } .
Proposition 1. 
Let { S t } = { X 1 , t + X 2 , t + + X s , t } . The spectral density of { S t } is given by
f S ( λ ) = i = 1 s f X i ( λ ) + i = 1 s 1 j = i + 1 s f X i X j ( λ ) + f ¯ X i X j ( λ )
where the cross-spectrum f X i X j ( λ ) is defined f X i , X j ( λ ) = 1 2 π r = e i λ r C X i X j ( r ) with C X i X j ( r ) = E [ X i , t + r X j , t ] .
Proof. 
We will prove this by induction. Note that when { S t } = { X 1 , t + X 2 , t } , f S ( λ ) is given by
f S ( λ ) = f X 1 ( λ ) + f X 2 ( λ ) + f X 1 X 2 ( λ ) + f ¯ X 1 X 2 ( λ )
and { S t } is ARMA. This matches representation (11).
Now suppose (11) holds for { S n , t } = { X 1 , t + + X n , t } processes. Thus
f S n ( λ ) = i = 1 n f X i ( λ ) + i = 1 n 1 j = i + 1 n f X i X j ( λ ) + f ¯ X i X j ( λ )
Consider { S n + 1 } = { S n , t + X n + 1 , t } . Since { S n , t } is ARMA, we observe that
f S n + 1 ( λ ) = f S n ( λ ) + f X n + 1 ( λ ) + f S n , X n + 1 ( λ ) + f ¯ S n , X n + 1 ( λ )
Note that starting with the definition of f S n , X n + 1 ( λ ) ,
f S n , X n + 1 ( λ ) = 1 2 π r = e i λ r E [ S n , t X n + 1 , t + r ]
= 1 2 π r = e i λ r E [ X 1 , t X n + 1 , t + r + + X n , t X n + 1 , t + r ]
= 1 2 π r = e i λ r E [ X 1 , t X n + 1 , t + r ] + + 1 2 π r = e i λ r E [ X n , t X n + 1 , t + r ]
= f X 1 X n + 1 ( λ ) + + f X n X n + 1 ( λ )
Similarly,
f ¯ S n , X n + 1 ( λ ) = f ¯ X 1 X n + 1 ( λ ) + + f ¯ X n X n + 1 ( λ )
Thus from (13),
f S n + 1 ( λ ) = f S n ( λ ) + f X n + 1 ( λ ) + f X 1 X n + 1 ( λ ) + + f X n X n + 1 ( λ ) + f ¯ X 1 X n + 1 ( λ ) + + f ¯ X n X n + 1 ( λ )
or equivalently
f S n + 1 ( λ ) = i = 1 n f X i ( λ ) + i = 1 n 1 j = i + 1 n f X i X j ( λ ) + f ¯ X i X j ( λ ) + f X n + 1 ( λ ) + f X 1 X n + 1 ( λ ) + + f X n X n + 1 ( λ ) + f ¯ X 1 X n + 1 ( λ ) + + f ¯ X n X n + 1 ( λ )
which can be simply written as
f S n + 1 ( λ ) = i = 1 n + 1 f X i + i = 1 n j = i + 1 n + 1 f X i X j ( λ ) + f ¯ X i X j ( λ )
and the result is proved. □
Now consider the covariance generating function G S ( z ) = j = E [ S t S t j ] z j of { S t } . Here we use the equivalence G S ( e i λ ) = 2 π f S ( λ ) and note the following:
f X i ( λ ) = σ i 2 2 π | Θ i ( e i λ ) | 2 | Φ i ( e i λ ) | 2
f X i X j ( λ ) = σ i j 2 π Θ i ( e i λ ) Φ i ( e i λ ) Θ j ( e i λ ) Φ j ( e i λ )
f ¯ X i X j ( λ ) = σ i j 2 π Θ i ( e i λ ) Φ i ( e i λ ) Θ j ( e i λ ) Φ j ( e i λ )
Thus from Proposition 1 we observe that
G S ( z ) = i = 1 s σ i 2 Θ i ( z ) Θ i ( z 1 ) Φ i ( z ) Φ i ( z 1 ) + i = 1 s 1 j = i + 1 s σ i j Θ i ( z ) Φ i ( z ) Θ j ( z 1 ) Φ j ( z 1 ) + σ i j Θ i ( z 1 ) Φ i ( z 1 ) Θ j ( z ) Φ j ( z )
As described in Theorem 5 of [5], the covariance generating function G S ( z ) can be factorized as the ratio O ( z ) P ( z ) Q ( z ) where O ( z ) , P ( z ) and Q ( z ) are Laurent polynomials, with O ( z ) having all its roots on the unit circle and P ( z ) and Q ( z ) having no roots on the unit circle. This result follows from the fact that each additive term in (77) is a ratio of Laurent Polynomials and the fact that for any Laurent polynomial P ( z ) , both P ( z ) P ( z 1 ) and P ( z ) + P ( z 1 ) will be Laurent polynomials. Furthermore if P 1 ( z ) and P 2 ( z ) are Laurent polynomials then P 1 ( z ) P 2 ( z ) and P 1 ( z ) + P 2 ( z ) will be Laurent polynomials as well.
We can now use the factorization provided in [7] and described in [5] to obtain the ARMA representation of { S t } with respect to the shocks { ϵ t } appearing in its Wold representation. It should be noted that Remark 2 is simply a restatement of Theorem 5 of [5] with the slight addition of determining the polynomials appearing in the ARMA representation of the aggregate sequence { S t } .
Remark 2. 
{ S t } can be represented with respect to shocks { ϵ t } as the ARMA model
Φ ( B ) S t = Θ ( B ) ϵ t
where Θ ( z ) = i = 1 m ( 1 a i z ) where { a i } are the roots of O ( z ) P ( z ) on or inside the unit circle and Φ ( z ) = i = 1 n ( 1 b i z ) where { b i } are the roots of Q ( z ) inside the unit circle. Furthermore
σ ϵ 2 = E [ ϵ t 2 ] = p m j = 1 m ( 1 / a j ) q n j = 1 n ( 1 / b j )
where p m is the coefficient of z m in P ( z ) and q n is the coefficient of z n in Q ( z ) .

2.3. Forecasting Using the Fully Aggregated Sequence { D τ } τ = t for A General Leadtime

We note that Remark 2 can be used to obtain the ARMA representation of { D t } with respect to its Wold shocks { ϵ t } as well as σ ϵ 2 = E [ ϵ t 2 ] . We can therefore use Lemma 1 of [3] and its proof to determine the best linear forecast (BLF) of k = 1 + 1 D t + k and its MSFE when forecasting using { D τ } τ = t . That is, if we consider the MA() representation of { D t } with respect to { ϵ t } given by
D t = Ψ ( B ) ϵ t
where Ψ ( z ) = 1 + Ψ 1 z + Ψ 2 z 2 + , then the MSFE of the BLF when using { D τ } τ = t is
M S F E a g g = σ ϵ 2 i = 0 ω i 2
where
ω i = 0 i < 0 ψ i i = 0 ω i 1 + ψ i 0 < i < + 1 ω i 1 + ψ i ψ i 1 i + 1 .
We note that Ψ ( z ) = Θ ( z ) Φ ( z ) .

2.4. Forecasting Using Subaggregated Sequences

In this section we use the results in the previous section to create a forecast and compute its MSFE when some of the individual streams are subaggregated. That is, for k 1 n , let cluster C k consist of n C k streams such that { C k , t = X 1 , t C k + + X n C k , t C k } . We are interested in the forecast and MSFE based on { C 1 , τ } τ = t , , { C n , τ } τ = t .
Section 2.2 describes how we can obtain the ARMA representation and variance of shocks appearing in the Wold representation for each sequence { C k , t } . This can then be used to create a forecast from each subaggregated sequence, the sum of which can be taken as the forecast for D t + + 1 . The one-step ahead MSFE of this forecast is the sum of the entries of the covariance matrix of the shocks appearing in the Wold representation of { ( C 1 , t , , C n , t } . Equation (10) describes how we can also use the covariance matrix of the shocks of { ( C 1 , t , , C n , t } to obtain the MSFE for multi-step ahead forecasts. The remainder of this section will focus on obtaining this covariance matrix.
Without loss of generality, consider two subaggregated series { C 1 , t = X 1 , t + + X a , t } and { C 2 , t = X a + 1 , t + + X b , t } with ARMA representations
ϕ 1 ( B ) C 1 , t = θ 1 ( B ) ϵ 1 , t
ϕ 2 ( B ) C 2 , t = θ 2 ( B ) ϵ 2 , t
where { ϵ 1 , t } and { ϵ 2 , t } are the shocks appearing in the Wold representation of { C 1 , t } and { C 2 , t } respectively. We note that the variances of { ϵ 1 , t } and { ϵ 2 , t } can be obtained using Remark 2. To obtain the covariance σ 12 = E [ ϵ 1 , t ϵ 2 , t ] consider the following.
We can rewrite the ARMA representations above as
ϵ 1 , t = ϕ 1 ( B ) θ 1 ( B ) C 1 , t
ϵ 2 , t = ϕ 2 ( B ) θ 2 ( B ) C 2 , t
The ARMA representations of { X 1 , t } , , { X b , t } can also be rewritten as
X 1 , t = Θ 1 ( B ) Φ 1 ( B ) ϵ 1 , t X b , t = Θ b ( B ) Φ b ( B ) ϵ b , t
Based on the definition of C 1 , t = X 1 , t + + X a , t and C 2 , t = X a + 1 , t + + X b , t we observe that
ϵ 1 , t = ϕ 1 ( B ) θ 1 ( B ) Θ 1 ( B ) Φ 1 ( B ) ϵ 1 , t + + Θ a ( B ) Φ a ( B ) ϵ a , t
ϵ 2 , t = ϕ 2 ( B ) θ 2 ( B ) Θ a + 1 ( B ) Φ a + 1 ( B ) ϵ a + 1 , t + + Θ b ( B ) Φ b ( B ) ϵ b , t
To obtain E [ ϵ 1 , t ϵ 2 , t ] we need to compute the expectation of the the product of the right-hand-sides of these two equations. Thus we need to consider the sum of terms such as
E ϕ 1 ( B ) θ 1 ( B ) Θ i ( B ) Φ i ( B ) ϵ i , t ϕ 2 ( B ) θ 2 ( B ) Θ j ( B ) Φ j ( B ) ϵ j , t .
Note that we can write
ϕ 1 ( B ) θ 1 ( B ) Θ i ( B ) Φ i ( B ) ϵ i , t = k = 0 ψ ˜ i , k ϵ i , t k
and
ϕ 2 ( B ) θ 2 ( B ) Θ j ( B ) Φ j ( B ) ϵ j , t = k = 0 ψ ˜ j , k ϵ j , t k
where ψ ˜ i , k and ψ ˜ j , k can be obtained in the same way as the MA() coefficients in (28). Hence the term in Equation (36) can be rewritten as
E k = 0 ψ ˜ i , k ϵ i , t k k = 0 ψ ˜ j , k ϵ j , t k
Since the shock sequences are not correlated across time by assumption, this is equivalent to
E k = 0 ψ ˜ i , k ψ ˜ j , k ϵ i , t k ϵ j , t k
or equivalently
k = 0 ψ ˜ i , k ψ ˜ j , k σ i j
Adding up these terms as required would yield the covariance. Hence
σ 12 = E [ ϵ 1 , t ϵ 2 , t ] = i = 1 a j = a + 1 b k = 0 ψ ˜ i , k ψ ˜ j , k σ i j
We note that this methodology can easily be extended to obtain any of the covariances in the covariance matrix
Σ ϵ = σ 11 σ 1 n σ 21 σ 2 n σ n 1 σ n n
We will use the previous methodology for all theoretical MSFE computations found in this paper. In the next subsection, we provide an example describing the importance of forecasting leadtime demand based upon the individual sequences.

2.5. Example

Consider a retailer that observes Aggregate demand { D t = X 1 , t + X 2 , t + X 3 , t } where each individual demand stream is given by the following ARMA processes:
X 1 , t = ( 1 . 9 B ) ϵ 1 , t
X 2 , t = ( 1 + . 9 B ) ϵ 2 , t
X 3 , t = ( 1 + . 9 B ) ϵ 3 , t
where the shock covariance matrix is given by Σ = 1.6 1.4 0.5 1.4 1.3 0.8 0.5 . 8 2.0
We use the results described in Section 2.1 and Section 2.2 to compare the MSFE of the forecasts of leadtime demand j = 1 + 1 D t + j when using the individual sequences { X 1 , t } τ = t , { X 2 , t } τ = t , { X 3 , t } τ = t versus when using the aggregate sequence { D τ } τ = t .
The covariance generating function of { D t } is given by
S D ( z ) = 0.090 z 1 + 5.631 + 0.090 z 1
and the ARMA representation of { D t } is given by
D t = ( 1 + . 01598704 B ) ϵ t
where σ ϵ 2 = 5.629561 . We can check that the ARMA representation of { D t } and its covariance generating function match up by noting that
S D ( z ) = σ ϵ 2 Θ ( z ) Θ ( z 1 ) Φ ( z ) Φ ( z 1 ) = 5.629561 ( 1 + . 01598704 z ) ( 1 + . 01598704 z 1 ) = 0.090 z 1 + 5.631 + 0.090 z
with Θ ( z ) = ( 1 + . 01598704 z ) and Φ ( z ) = 1 being the MA and AR polynomials appearing in the ARMA representation of { D t } as defined in Remark 2.
We note that the one-step ahead forecast then has a MSFE=5.629561 and a two-step ahead forecast (of D t + 1 + D t + 2 ) has a MSFE=11.44056.
To obtain the MSFE when using the individual sequences we consider the multivariate-ARMA representation
1 0 0 0 1 0 0 0 1 X 1 , t X 2 , t X 3 , t = 1 . 9 B 0 0 0 1 + . 9 B 0 0 0 1 + . 9 B ϵ 1 , t ϵ 2 , t ϵ 3 , t
with the shock-covariance matrix as stated earlier. Using Equation (10) and noting that ω k , 0 = ψ k , 0 = 1 in Equation (7) we see that the one-step ahead forecast has a MSFE=1.5. We note that this forecast error is 4.129561 (73.3%) lower that the forecast error when using the aggregated series. We can likewise compute the list of elements ω k , i when forecasting D t + 1 + D t + 2 . In this case, ω k , 0 = ψ k , 0 = 1 and ω 1 , 1 = . 1 , ω 2 , 1 = 1.9 , and ω 3 , 1 = 1.9 . From (10) this implies that the MSFE=7.311, which is also 4.129561 (36.1%) lower than the forecast error when using the aggregated series. We note that in general the difference in MSFE when forecasting using the aggregate or the individual series will be different for different lead-times however when considering individual series other than MA(1). In the next section we demonstrate, with example, the benefit of forecasting using subaggregated clusters (which would have been identified using a suitable technique).

3. The Benefit of Forecasting Using Subaggregated Clusters

In this section, we consider a retailer that has ten demand streams, which aggregate into three clusters consisting of similar streams. Specifically, suppose the retailer observes Aggregate demand { D t = X 1 , t + X 2 , t + + X 10 , t } where each individual demand stream is given by the following ARMA processes:
( 1 . 3 B . 6 B 2 ) X 1 , t = ( 1 . 6 B . 2 B 2 ) ϵ 1 , t
( 1 . 35 B . 5 B 2 ) X 2 , t = ( 1 . 65 B . 15 B 2 ) ϵ 2 , t
( 1 . 27 B . 55 B 2 ) X 3 , t = ( 1 . 63 B . 17 B 2 ) ϵ 3 , t
( 1 . 8 B ) X 4 , t = ϵ 4 , t
( 1 . 9 B ) X 5 , t = ϵ 5 , t
( 1 . 75 B ) X 6 , t = ϵ 6 , t
( 1 + . 77 B ) X 7 , t = ( 1 + . 6 B ) ϵ 7 , t
( 1 + . 68 B ) X 8 , t = ( 1 + . 55 B ) ϵ 8 , t
( 1 + . 73 B ) X 9 , t = ( 1 + . 52 B ) ϵ 9 , t
( 1 + . 7 B ) X 10 , t = ( 1 + . 5 B ) ϵ 10 , t
where the shock covariance matrix is given by
Σ = 2 1 0.8 0.9 1.2 1.5 0.8 0.9 0.95 1 1 2.1 0.7 0.6 0.5 0.4 0.21 0.31 0.36 0.39 0.8 0.7 2.2 0.5 1.3 1 0.4 0.8 1 1.1 0.9 0.6 0.5 3 1.8 1.9 2 2.1 2.2 2.3 1.2 0.5 1.3 1.8 3.2 2 1.9 1.8 1.7 1.5 1.5 0.4 1 1.9 2 3.3 2.2 2.3 2.4 2.5 0.8 0.21 0.4 2 1.9 2.2 5 1 1.25 1.5 0.9 0.31 0.8 2.1 1.8 2.3 1 5.1 1.3 1.6 0.95 0.36 1 2.2 1.7 2.4 1.25 1.3 5.7 1.8 1 0.39 1.1 2.3 1.5 2.5 1.5 1.6 1.8 5.9
Consider the three natural clusters in the above ten demand streams, namely, streams 1-3, 4-6, and 7-10. It can be shown that indeed this grouping is optimal out of any other possible choice of three clusters simply by looking at all possible combinations and computing their MSFEs. Our analysis of the MA(1) case in Section 6 also points to these being the correct clusters based on the proximity of the ARMA coefficients between the demand streams. We demonstrate that even though the best the retailer can do in such a situation is forecast all ten streams individually, if the retailer would correctly cluster the demand streams as mentioned, the results are similar. All MSFEs stated in this section are obtained using the methods described in Section 2.
Specifically in this case if the retailer were to forecast the individual streams, its one-step ahead theoretical MSFE would be 21.64. If the retailer were to consider the subaggregated processes consisting of the correct clusters and forecast these separately then the MSFE would be 21.74. This is in stark contrast to a MSFE of 61.39 when forecasting using the aggregate process of all ten streams. In other words, it is sufficient to determine clusters of similar customers as opposed to forecasting each individual stream in order to keep inventory related costs down.
To see the impact of choosing the correct clusters we consider the case that the retailer incorrectly clusters the streams as 1-2, 3-5, and 6-10. In this case the MSFE rises to 33.4. Similarly, if the clusters chosen are 1 & 4 , 2 & 3 & 5 , and 6-10 then the MSFE is 45.04. Assigning streams randomly to three clusters consisting of three, two and five streams yields the following table.
MSFE Clusters
52.34495576 6,10,9 and 1,2 and 8,3,4,5,7
51.90912188 2,10,5 and 3,8 and 4,1,7,6,9
31.40789218 7,2,10 and 8,9 and 5,4,6,3,1
44.15962369 10,1,7 and 3,4 and 8,2,9,6,5
50.32525078 5,8,3 and 1,10 and 4,2,7,6,9
39.31100769 6,9,5 and 10,7 and 1,3,8,4,2
45.09358141 3,1,10 and 5,8 and 2,4,9,7,6
51.54828609 9,5,10 and 4,2 and 7,6,8,1,3
34.21154829 8,7,2 and 1,9 and 6,4,5,10,3
55.21445794 6,1,10 and 2,4 and 7,3,8,5,9
We note that there could be substantial reduction in MSFE even when multiple streams are clustered incorrectly. In the next section, we demonstrate how a retailer would be able to generate clusters of its individual demand streams using Pivot Clustering.

4. TS Clustering Algorithms and Pivot Clustering, Empirical Evaluation

We have shown that the when the retailer forecasts using clusters of its demand streams as opposed to the individual streams, its MSFE can vary greatly from close to the optimal value (when forecasting using the individual streams) to close to the MSFE of a forecast based on the aggregate of all the streams. Hence, if the retailer wishes to forecast using clusters of its demand streams, the selection of the clusters is important. In general, a retailer might have customer related information that could be used to generate the clusters. The benefit of generating the clusters and forecasting them is that there could be situations where it would be cumbersome for the retailer to collect the individual demand streams and service them individually. Forecasting clusters would therefore be a second best option.
We propose Pivot Clustering for determining clusters which usually results in a relatively low MSFE among all choices of streams to clusters. We consider two ways to obtain the subaggregated MSFE based on some clustering assignment. The first is to use the individual ARMA demand models appearing in (2) to compute the subaggregated theoretical MSFE as per Section 2.2, Section 2.3 and Section 2.4. We note that if there is only one cluster then the MSFE is the one computed for a forecast based on the aggregate of the all the streams while if the number of clusters is equal to the number of streams, the MSFE is for a forecast based on the disaggregated (individual) sequences. We also estimate the MSFE by generating demand realizations for each stream based on (2) and (3). We then subaggregate the realizations to determine the subaggregated realizations for each cluster of demand streams and use these to estimate ARMA(5,5)2 models anytime it is necessary to obtain the ARMA model for a certain cluster and compute the in-sample forecast errors based on these realizations. The covariance matrix of these forecast errors is then used to estimate the MSFE for a particular assignment of streams to clusters. In the analysis below we see that the estimated MSFEs are close to theoretical ones and often lead to similar choices of clusters.
Based on a predetermined number of clusters n, Pivot Clustering works as follows.
For each stream, randomly assign it to a cluster.
  • For each cluster,
    • For each stream in the cluster,
      • Compute or estimate the MSFE for the current assignment along
      • with the resulting MSFEs if the stream was moved to each of the other clusters.
      • # MSFE can be either estimated based on realizations of the demand streams or
      • # computed using Equations (29), (30), (42) and (43).
    • Move each stream in the cluster to a cluster which leads to largest overall
    • MSFE reduction among all choice of clusters.
From a computational standpoint, it is possible to determine theoretical MSFE based on the aggregate of up to twenty ARMA(1,1) demand streams. The burden lies in having to find roots of large degree polynomials in order to determine the ARMA model and shock variance that describes the aggregated sequence. To understand the computational requirements we check the runtimes of Pivot Clustering when determining theoretical and estimated MSFEs. Based on a given number of streams (between 10 and 20) we carry out Pivot Clustering for twenty different combinations of ARMA(1,1) models and plot the average of the runtimes in Figure 1 in assigning the streams to four clusters. If using estimated MSFE then many more streams can be clustered and Pivot Clustering has faster runtimes for larger amounts of streams. Upon checking, Pivot Clustering with estimated MSFEs for 200 streams takes approximately 20 minutes.
We can check the efficacy of Pivot Clustering through simulation. We begin by randomly assigning coefficients to twenty ARMA(1,1) models to produce twenty demand streams as well as the covariance matrix of the shock sequences. We make sure that each assignment results in causal and invertible demand with respect to the shocks and that the resulting covariance matrix is positive definite. The AR and MA coefficients and covariance matrix can be found in https://github.com/vkovtun84/Pivot-Clustering-of-Demand-Streams under Models.csv and covarmat.csv.
After randomly assigning streams to one of four clusters we compute both the estimated and theoretical one-step-ahead MSFEs based on this random assignment and use it to start Pivot Clustering. We output the clusters found by Pivot as well as the MSFE of the forecast based on this set of clusters. We iterate this procedure 50 times to study how much the MSFE improves based on Pivot Clustering for the starting allocations. The MSFEs of the final clusters and random clusters can be found under MSFEresults.csv in https://github.com/vkovtun84/Pivot-Clustering-of-Demand-Streams. These can also be compared with the MSFEs of the forecast based on the individual (disaggregated) demand streams and the forecast based on fully aggregating the streams.
For the twenty demand streams and models used, the theoretical and estimated MSFEs when forecasting based on individual (disaggregated) streams are 102.1 and 96.2. The theoretical and estimated MSFEs when forecasting based on the fully aggregated streams are 231.3 and 220.6. For the 50 simulations of assigning streams to random clusters (used in the initialization step of Pivot) the average of the theoretical and estimated MSFEs based on the subaggregated random clusters are 202.2 and 194.2. After Pivot Clustering is carried out to obtain a better set of subaggregated clusters in each of the 50 simulations, the averages of the theoretical and estimated MSFEs are 109.4 and 101.0. The various theoretical and estimated MSFEs for the different initializations are provided in Figure 2 and Figure 3. We note that regardless of the initial random assignment of streams to clusters, Pivot Clustering leads to the clustering of streams such that the subaggregated MSFE is very low. In fact, typically Pivot Clustering results in clusters for which the subaggregated MSFE ends up very close to the MSFE obtained when forecasts are based on the individual (disaggregated) streams.
We can compare our results with existing time-series clustering methods. Two distance measures that can be computed for time-series realizations are available in the TSclust package for R, namely AR.PIC and AR.LPC.CEPS. These distances can be used to perform hierarchical clustering such as average-linkage clustering. The final groups determined by these methods lead to MSFEs of 123.4 and 108.8 respectively, higher than those found by Pivot starting from random assignments. We note that the cluster assignments found by these methods can also be used in the initialization of Pivot Clustering, potentially leading to even better clusters.
Since the previous simulations were carried out on only one set of twenty ARMA(1,1) demand streams, we should also check the efficacy of Pivot Clustering for other sets of streams. As such, we consider twenty simulations where within each simulation a new set of twenty demand streams is considered. We compare the estimated and theoretical MSFEs of one random assignment of streams to four clusters with the estimated and theoretical MSFEs of the four clusters obtained by Pivot Clustering. In each simulation we also compute the MSFEs that would be found when fully aggregating the streams or when considering forecasts based on individual streams as well as the MSFEs that would be found using the AR.PIC and AR.LPC.CEPS distances for hierarchical clustering streams into four clusters. The results of these simulations are displayed in Figure 4 and Figure 5. We note that if forecasts are to be based on four clusters, the lowest MSFEs are obtained when clusters are formed using Pivot Clustering. Furthermore, Pivot Clustering leads to forecasts whose MSFE is very close to the MSFE of the forecast based on the individual streams in every simulation.
We conclude with twenty simulations where in each simulation we consider a separate set of 20 streams being subaggregated into four clusters with 10 random initializations of Pivot Clustering. The means of the various theoretical and estimated MSFEs under different clustering approaches are displayed in Figure 6 and Figure 7. We note again that in every set of twenty streams, the averaged subaggregated MSFEs are very close to the disaggregated MSFEs when averaged for different initial random assignments of streams to clusters.

5. Clustering ARMA Demand Streams Through Minimizing An Objective Function Based on Subaggregated MSFE

In this section we describe how to determine the optimal assignment of streams to clusters by identifying and minimizing an objective function which computes the overall MSFE given a particular assignment of streams to clusters. We begin by assuming that the desired number of clusters is known to be n. For α 1 , 2 , n , let subaggregated cluster series { C α , t } = { X 1 , t y 1 , α + + X N , t y N , α } , where y i , α = 1 if stream { X i , t } is in cluster { C α , t } and 0 otherwise, have the ARMA representation
ϕ α ( B ) C α , t = θ α ( B ) ϵ α , t .
For α 1 , 2 , n , the shocks ϵ α , t have covariance matrix Σ ϵ given in Equation (43). We define the number of demand streams in cluster { C α , t } to be n C α = i = 0 N y i , α > 0 for all α 1 , 2 , n . Furthermore if y i , α = 1 then y i , β = 0 for all α β 1 , 2 , n . For each of n C α streams { X i , t C α } in cluster { C α , t } we adapt the notation of its ARMA representation to be
Φ i , α ( B ) X i , t C α = Θ i , α ( B ) ϵ α , i , t .
Lemma 1. 
An optimal set of clusters can be found by minimizing the subaggregated MSFE given by
M S F E s u b a g g = α = 1 n β = 1 n l = 0 ω β , l ω α , l · i = 1 n C α j = 1 n C β k = 0 ψ ˜ α , i , k ψ ˜ β , j , k σ i j
where ψ ˜ α , i , k and ψ ˜ β , j , k are obtained from the equivalence
ϕ α ( z ) θ α ( z ) Θ i , α ( z ) Φ i , α ( z ) k = 0 ψ ˜ α , i , k z k
and
ϕ β ( z ) θ β ( z ) Θ j , β ( z ) Φ j , β ( z ) k = 0 ψ ˜ β , j , k z k .
where the key terms are defined in the proof below.
Alternatively, the objective can be stated as finding the optimal set of { y 1 , , y N } such that we minimize
M S F E s u b a g g = α = 1 n β = 1 n l = 0 ω β , l ω α , l · i = 1 N j = 1 N k = 0 y i , α y j , β ψ ˜ α , i , k ψ ˜ β , j , k σ i j
where ω α , l and ψ α , l in Equation (70) are obtained through the equivalence
θ α ( z ) ϕ α ( z ) = l = 0 ψ α , l z l
where again the key terms are defined in subsequent proof.
Proof 
From Equation (10) we note that a forecast based on the clusters would have MSFE given by
M S F E s u b a g g = α = 1 n β = 1 n l = 0 σ α β ω β , l ω α , l
where
ω α , l = 0 i < 0 ψ α , l l = 0 ω α , l 1 + ψ α , l 0 < l < + 1 ω α , l 1 + ψ α , l ψ α , l 1 l + 1 .
where ψ α , l is the l t h coefficient appearing in the MA() representation of { C α , t } with respect to { ϵ α , t } . From Equation (42) we note that for any two subaggregated clusters { C α , t } and { C β , t } consisting of n C α and n C β streams respectively, the corresponding shock series { ϵ α , t } and { ϵ β , t } , the covariance E [ ϵ α , t ϵ β , t ] is expressed by
σ α β = i = 1 n C α j = 1 n C β k = 0 ψ ˜ α , i , k ψ ˜ β , j , k σ i j .
Therefore the objective is to assign streams to clusters such that we minimize the MSFE
M S F E s u b a g g = α = 1 n β = 1 n l = 0 ω β , l ω α , l · i = 1 n C α j = 1 n C β k = 0 ψ ˜ α , i , k ψ ˜ β , j , k σ i j
where ψ ˜ α , i , k and ψ ˜ β , j , k are obtained from the equivalence
ϕ α ( z ) θ α ( z ) Θ i , α ( z ) Φ i , α ( z ) k = 0 ψ ˜ α , i , k z k
and
ϕ β ( z ) θ β ( z ) Θ j , β ( z ) Φ j , β ( z ) k = 0 ψ ˜ β , j , k z k .
Alternatively, we can say that we are finding the optimal set of { y 1 , , y N } such that we minimize
M S F E s u b a g g = α = 1 n β = 1 n l = 0 ω β , l ω α , l · i = 1 N j = 1 N k = 0 y i , α y j , β ψ ˜ α , i , k ψ ˜ β , j , k σ i j
where ω α , l and ψ α , l in Equation (70) are obtained through the equivalence
θ α ( z ) ϕ α ( z ) = l = 0 ψ α , l z l
with θ α ( z ) and ϕ α ( z ) found using Remark 2 where Laurent polynomials O ( z ) , P ( z ) and Q ( z ) are obtained from the covariance generating function G C α ( z ) given by
G C α ( z ) = i = 1 N y i , α σ i 2 Θ i ( z ) Θ i ( z 1 ) Φ i ( z ) Φ i ( z 1 ) + i = 1 N 1 j = i + 1 N y i , α y j , α σ i j Θ i ( z ) Φ i ( z ) Θ j ( z 1 ) Φ j ( z 1 ) + σ i j Θ i ( z 1 ) Φ i ( z 1 ) Θ j ( z ) Φ j ( z ) .
We note that it is impossible to offer an explicit solution because of the dependence of coefficients ω β , l , ω α , l , ψ ˜ α , i , k and ψ ˜ β , j , k on the selection of clusters. In the next section we consider a much simpler case of demand streams being generated by MA(1) models which leads to a much simpler objective function. This allows us to find several theoretical results, culminating in the fact that optimal clusters can be found in this case by identifying streams having the closest MA coefficients with one another.

6. MA(1) Streams

In this section we consider the case that the demand streams being considered are independent MA(1). As we demonstrate below this leads to a simpler objective function. We will use this fact to show how we can use non-linear optimization to assign clusters to streams and to come up with an efficient way to cluster independent MA(1) streams based on segmenting the coefficient space into intervals. At the end of the section, we provide a lemma on aggregating ARMA streams having identical ARMA coefficients.
Lemma 2. 
Suppose { X 1 , t } , { X 2 , t } , , { X N , t } are MA(1) with MA coefficients θ 1 , θ 2 , , θ N . Optimal clusters can be found by assigning y j k as an indicator variable for stream X j being in cluster C k such that we minimize
k = 1 n ( b k + 2 a k ) ( b k 2 a k )
where
b k = j = 1 N σ j 2 ( 1 + θ j 2 ) y j k
and
a k = j = 1 N σ j 2 θ j y j k .
Alternatively, the objective function (78) can be written as
k = 1 n j = 1 N i = 1 N σ j 2 σ i 2 ( 1 + θ j ) 2 ( 1 θ i ) 2 y j k y i k .
Proof 
uppose { X 1 , t } , { X 2 , t } , , { X N , t } are MA(1) with MA coefficients θ 1 , θ 2 , , θ N . Suppose cluster C α , t consists of streams { X 1 , t } , , { X α , t } . The covariance generating function of C α , t simplifies to
G C α ( z ) = σ 1 2 ( 1 + θ 1 z ) ( 1 + θ 1 z 1 ) + + σ α 2 ( 1 + θ α z ) ( 1 + θ α z 1 ) .
In order to determine the variance of the shocks of C α , t we need to find the roots of Equation (82). Note that it can be rewritten as
σ 1 2 θ 1 + + σ α 2 θ α z 1 + σ 1 2 ( 1 + θ 1 2 ) + + σ α 2 ( 1 + θ α 2 ) + σ 1 2 θ 1 + + σ α 2 θ α z .
We can find the roots of Equation (83) using the quadratic formula b ± b 2 4 a 2 2 a where
a = σ 1 2 θ 1 + + σ α 2 θ α = j = 1 α σ j 2 θ j
b = σ 1 2 ( 1 + θ 1 2 ) + + σ α 2 ( 1 + θ α 2 ) = j = 1 α σ j 2 ( 1 + θ j 2 )
We note that one roots r 1 will be outside the unit circle and from Remark 2 that the variance of the shocks of C α , t is equal to a r 1 . Since b > 0 , the variance is then given by
σ α 2 = b + ( b + 2 a ) ( b 2 a ) 2
Thus if we are subaggregating into n clusters, the MSFE is given by
M S F E s u b a g g = 1 2 k = 1 n b k + k = 1 n ( b k + 2 a k ) ( b k 2 a k )
Since k = 1 n b k will be the same regardless of choice of clusters, we are minimizing the objective function given by
k = 1 n ( b k + 2 a k ) ( b k 2 a k )
where
b k = j = 1 N σ j 2 ( 1 + θ j 2 ) y j k
and
a k = j = 1 N σ j 2 θ j y j k
where y j k is an indicator variable for stream X j being in cluster C k .
Noting that
( b k + 2 a k ) = j = 1 N σ j 2 ( 1 + θ j ) 2 y j k
( b k 2 a k ) = j = 1 N σ j 2 ( 1 θ j ) 2 y j k
The objective function can also be rewritten as
k = 1 n j = 1 N i = 1 N σ j 2 σ i 2 ( 1 + θ j ) 2 ( 1 θ i ) 2 y j k y i k .
We have results (not shown here) where we use Equation (93) in a non-linear optimization algorithm to come up with clusters. When using 10 streams with 3 clusters, the approach actually leads to a globally optimal solution almost every time.
We note that Equation (93) can also be rewritten as
k = 1 n X i , X j C k σ j 2 σ i 2 ( 1 + θ j ) 2 ( 1 θ i ) 2 .
where the inner sum is taken over all pairs of streams in each cluster, including pairs of streams with themselves.
If stream X p is moved from cluster k to cluster κ then the change in the objective function is
σ p 2 ( 1 + θ p ) 2 X i C κ σ i 2 ( 1 θ i ) 2 + σ p 2 ( 1 θ p ) 2 X i C κ σ i 2 ( 1 + θ i ) 2 σ p 2 ( 1 + θ p ) 2 X i C k σ i 2 ( 1 θ i ) 2 + σ p 2 ( 1 θ p ) 2 X i C k σ i 2 ( 1 + θ i ) 2
which provides an alternative way to cluster streams by identifying and moving the stream from one cluster to another which yields the largest drop in the objective function.

6.1. Aggregate of Two MA(1) Streams

In this subsection we consider two MA(1) streams whose variance of shocks is unitary. We demonstrate that the MA coefficient3 of their aggregate process is always between the two MA coefficients of the individual streams. Furthermore we show that as the two coefficients of the two MA(1) streams are moved further apart from each other the variance of the shocks appearing in the Aggregated process increases. These two facts imply that if we are studying N individual MA(1) streams (with unit shock variance) and would like to cluster them into n clusters then the globally optimal clustering assignment will cluster the streams along intervals. That is, the assignment will have split the clusters into groups of streams whose MA coefficients are next to each other in a sorted arrangement. This implies that an efficient algorithm for obtaining a globally optimal cluster assignment consists of arranging the MA coefficients in increasing order and checking all possible "interval" clusters, without worrying that if two streams are clustered together, another stream with an MA coefficient between the two is assigned to a different cluster. This algorithm would need to check N n possible cluster assignments to find the globally optimal arrangement.
Theorem 1. 
Consider two streams whose ARMA representations are given by
X 1 , t = ( 1 + θ 1 B ) ϵ 1 , t
X 2 , t = ( 1 + θ 2 B ) ϵ 2 , t
with v a r ( ϵ 1 , t ) = v a r ( ϵ 2 , t ) = 1 and σ 12 = 0 and θ 1 < θ 2 . Note that θ 1 and θ 2 are allowed to equal 0.
The Aggregated process { X 1 , t + X 2 , t } is described by the MA(1) model
X 1 , t + X 2 , t = ( 1 + θ B ) ϵ t
such that θ 1 < θ < θ 2 .
Proof. 
Note that the covariance generating function of the aggregate process is
G S ( z ) = ( θ 1 + θ 2 ) z 1 + ( 2 + θ 1 2 + θ 2 2 ) + ( θ 1 + θ 2 ) z .
As long as4 θ 1 θ 2 , this polynomial has roots a 1 and 1 / a 1 . Suppose a 1 is inside the unit circle, then according to Remark 2, θ = a 1 and v a r ( ϵ t ) = σ ϵ 2 = ( θ 1 + θ 2 ) ( 1 / a 1 ) .
We can note that the Laurent polynomial in Equation (96) has roots given by
( 2 + θ 1 2 + θ 2 2 ) ± ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2 ( θ 1 + θ 2 )
This implies that
a 1 = ( 2 + θ 1 2 + θ 2 2 ) + ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2 ( θ 1 + θ 2 )
and
θ = ( 2 + θ 1 2 + θ 2 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2 ( θ 1 + θ 2 )
In the remainder of this section we will prove that
θ 1 < ( 2 + θ 1 2 + θ 2 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2 ( θ 1 + θ 2 ) < θ 2
Suppose first that θ 1 + θ 2 > 0 .
Then we can rewrite Equation (100) as
2 θ 1 2 + 2 θ 2 θ 1 < ( 2 + θ 1 2 + θ 2 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 < 2 θ 2 2 + 2 θ 2 θ 1
or equivalently,
θ 1 2 θ 2 2 + 2 θ 2 θ 1 2 < ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 < θ 2 2 θ 1 2 + 2 θ 1 θ 2 2
or
θ 1 2 θ 2 2 2 θ 2 θ 1 + 2 < ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 < θ 2 2 θ 1 2 2 θ 1 θ 2 + 2
Noting that the left and right-hand sides of the inequality are always larger than zero, (103) holds if and only if the following inequality holds as well
( θ 1 2 θ 2 2 2 θ 2 θ 1 + 2 ) 2 < ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 < ( θ 2 2 θ 1 2 2 θ 1 θ 2 + 2 ) 2 .
Labeling the three sides of this inequality as A < B < C , we observe that
A = θ 1 4 θ 1 2 θ 2 2 2 θ 1 3 θ 2 + 2 θ 1 2 θ 1 2 θ 2 2 + θ 2 4 + 2 θ 2 3 θ 1 2 θ 2 2 2 θ 1 3 θ 2 + 2 θ 2 3 θ 1 + 4 θ 1 2 θ 2 2 4 θ 2 θ 1 + 2 θ 1 2 2 θ 2 2 4 θ 2 θ 1 + 4 B = 4 + 2 θ 1 2 + 2 θ 2 2 + 2 θ 1 2 + θ 1 4 + θ 1 2 θ 2 2 + 2 θ 2 2 + θ 1 2 θ 2 2 + θ 2 4 4 ( θ 1 2 + 2 θ 1 θ 2 + θ 2 2 ) C = θ 2 4 θ 2 2 θ 1 2 2 θ 2 3 θ 1 + 2 θ 2 2 θ 2 2 θ 1 2 + θ 1 4 + 2 θ 1 3 θ 2 2 θ 1 2 2 θ 2 3 θ 1 + 2 θ 1 3 θ 2 + 4 θ 2 2 θ 1 2 4 θ 1 θ 2 + 2 θ 2 2 2 θ 1 2 4 θ 1 θ 2 + 4
Removing equivalent terms and combining like terms, we observe
2 θ 1 2 θ 2 2 4 θ 1 3 θ 2 + 4 θ 1 2 + 4 θ 2 3 θ 1 4 θ 2 2 8 θ 1 θ 2 < 2 θ 1 2 θ 2 2 8 θ 1 θ 2 < θ 2 2 θ 1 2 4 θ 2 3 θ 1 + 4 θ 2 2 + 4 θ 1 3 θ 2 4 θ 1 2 8 θ 2 θ 1
which can be rewritten as
4 θ 1 3 θ 2 + 4 θ 1 2 + 4 θ 1 θ 2 3 4 θ 2 2 < 0 < 4 θ 1 θ 2 3 + 4 θ 2 2 + 4 θ 1 3 θ 2 4 θ 1 2 .
Noting that the left and right-hand sides of this inequality are additive inverses, we see that this inequality holds5 and therefore inequality (100) holds.
Finally, if θ 1 + θ 2 < 0 in (100) then the direction of the inequalities is reversed in (101) and a similar sequence of steps would lead us to observe
4 θ 1 3 θ 2 + 4 θ 1 2 + 4 θ 1 θ 2 3 4 θ 2 2 > 0 > 4 θ 1 θ 2 3 + 4 θ 2 2 + 4 θ 1 3 θ 2 4 θ 1 2
and by the same argument we see that (100) holds and the theorem is proved. □.
Theorem 2. 
Consider two streams whose ARMA representations are given by
X 1 , t = ( 1 + θ 1 B ) ϵ 1 , t
X 2 , t = ( 1 + θ 2 B ) ϵ 2 , t
with σ 1 2 = σ 2 2 = 1 and σ 12 = 0 and θ 1 < θ 2 . Note that θ 1 and θ 2 may be 0.
As the distance between θ 1 and θ 2 increases, v a r ( ϵ t ) = σ ϵ 2 increases.
Proof 
From Equation (96) we see that the root 1 / a 1 of G S ( z ) , which is outside the unit circle, is given by
( 2 + θ 1 2 + θ 2 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2 ( θ 1 + θ 2 ) .
Since σ ϵ 2 = ( θ 1 + θ 2 ) / a 1 , we have
σ ϵ 2 = ( 2 + θ 1 2 + θ 2 2 ) + ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 2
To show that this is increasing in θ 2 , consider the derivative of the above with respect to θ 2 :
σ ϵ 2 θ 2 = θ 2 + θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 ( θ 1 + θ 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 .
We need to show that (110) is larger than zero, thus consider
σ ϵ 2 θ 2 = θ 2 + θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 ( θ 1 + θ 2 ) ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 > 0
or equivalently
θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 ( θ 1 + θ 2 ) > θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2
which simplifies to
2 ( θ 1 + θ 2 ) θ 2 ( 2 + θ 1 2 + θ 2 2 ) < θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2
or
2 θ 1 θ 2 θ 1 2 θ 2 3 < θ 2 ( 2 + θ 1 2 + θ 2 2 ) 2 4 ( θ 1 + θ 2 ) 2 .
We will refer to the left and right hand sides of the inequality in (114) as L H S and R H S . Note that squaring both yields
L H S 2 = 4 θ 1 2 4 θ 1 3 θ 2 4 θ 1 θ 2 3 + θ 1 4 θ 2 2 + 2 θ 1 2 θ 2 4 + θ 2 6 R H S 2 = 4 θ 2 2 + 4 θ 1 2 θ 2 2 + θ 1 4 θ 2 2 + 2 θ 1 2 θ 2 4 + θ 2 6 4 θ 1 2 θ 2 2 8 θ 1 θ 2 3
Note that furthermore
L H S 2 R H S 2 = 4 θ 1 2 4 θ 2 2 4 θ 1 3 θ 2 + 4 θ 1 θ 2 3 = 4 ( 1 θ 1 θ 2 ) ( θ 1 2 θ 2 2 )
Suppose first that θ 1 < θ 2 = 0 . Note that (114) reduces to 2 θ 1 < 0 , which holds, and therefore (111) holds as well.
Next suppose that 0 = θ 1 < θ 2 . Note that (114) reduces to
θ 2 3 < θ 2 ( 2 + θ 2 2 ) 2 4 θ 2 2
which holds as well since the left-hand side is negative in this case.
In the remainder of the proof we assume that θ 1 0 and θ 2 0 . Consider the case that | θ 2 | > | θ 1 | . Note that this implies that θ 2 >0 (since θ 2 > θ 1 ) and that (116) is less than zero. Thus in this case L H S 2 < R H S 2 and L H S < R H S in (114). Therefore (111) holds in this case as well.
Now suppose that | θ 1 | > | θ 2 | and that θ 2 > 0 . The former implies that L H S 2 R H S 2 > 0 and the latter implies that R H S > 0 . Therefore L H S < 0 and therefore L H S < R H S . Therefore (111) holds in this case as well.
Finally suppose that | θ 1 | > | θ 2 | and that θ 2 < 0 . The former again implies that L H S 2 R H S 2 > 0 while the latter implies that R H S < 0 . Therefore L H S < R H S and (111) holds in all cases. □.

6.2. Demand Streams Produced by Identical ARMA models

Note that Theorem 1 and Theorem 2 imply that the prescribed algorithm at the top of the previous subsection always leads to an optimal solution. The following lemma establishes that in the event that two streams are generated by the same ARMA model, the aggregate will also follow the same ARMA model. Therefore, we can greatly reduce the dimensionality of the number of streams that need to be assigned to clusters by first aggregating demand streams from equivalent models.
Lemma 3. 
Consider two ARMA sequences { X 1 , t } and { X 2 , t } that have the same ARMA representation with respect to Wold shock sequences { ϵ 1 , t } and { ϵ 2 , t } given by
Φ ( B ) X 1 , t = Θ ( B ) ϵ 1 , t
Φ ( B ) X 2 , t = Θ ( B ) ϵ 2 , t
such that the variance of the shock sequences are σ 1 2 and σ 2 2 with covariance σ 12 .
The aggregate { S t = X 1 , t + X 2 , t } also has the same ARMA representation with respect to its Wold shocks { ϵ t } given by
Φ ( B ) S t = Θ ( B ) ϵ t
such that the variance of { ϵ t } is given by σ 1 2 + σ 2 2 + 2 σ 12 .
Proof. 
From Remark 2 we note that the ARMA representation of { S t } is given by
Φ ( B ) S t = Θ ( B ) ϵ t
such that Θ ( z ) = i = 1 m ( 1 a i z ) where { a i } are the roots of O ( z ) P ( z ) on or inside the unit circle and Φ ( z ) = i = 1 n ( 1 b i z ) where { b i } are the roots of Q ( z ) inside the unit circle with O ( z ) , P ( z ) and Q ( z ) are obtained from the covariance generating function G S ( z ) given by
G S ( z ) = ( σ 1 2 + σ 2 2 ) Θ ( z ) Θ ( z 1 ) Φ ( z ) Φ ( z 1 ) + 2 σ 12 Θ ( z ) Θ ( z 1 ) Φ ( z ) Φ ( z 1 )
as per (77). This can further be simplified as
G S ( z ) = ( σ 1 2 + σ 2 2 + 2 σ 12 ) Θ ( z ) Θ ( z 1 ) Φ ( z ) Φ ( z 1 )
and therefore Φ ( z ) = Φ ( z ) and Θ ( z ) = Θ ( z ) and the result is proved. □
Lemma 3 shows us that if we have n demand sequences X 1 , t , , X n , t , which have the same ARMA coefficients with respect to their Wold shocks, their aggregate will have the same ARMA coefficients. Therefore if the customer base of a firm is comprised of many demand streams having the same ARMA representation, it is possible to greatly reduce the number of streams that need to be considered for clustering by first aggregating these equivalent streams.

7. Extensions and Other Questions

In this paper we compare theoretical MSFEs of a firm forecasting its leadtime demand based on disaggregated (individual) demand streams and subaggregated clusters formed from those streams. We highlight examples that illustrate that the MSFE based on subaggregates need not be much larger than the MSFE based on the disaggregated streams as long as those clusters are well-formed. We propose a Pivot algorithm to form clusters which minimize the MSFE among all cluster assignments. We end with some theoretical results when the demand streams are generated by MA(1). Here we show that clusters resulting in the lowest MSFE are formed by grouping streams by the proximity of their MA coefficient.
The MA(1) case hints that in a general ARMA case, "best" clusters would be formed based on proximity of the ARMA coefficients between models generating the various streams (or equivalently based on the proximity of roots of the AR and MA polynomials). Alternatively, best cluster assignments may result from grouping streams with most similar coefficients appearing in the MA() representation. Future work can be done to establish the best approach.
Our current theoretical approach based on general ARMA models is limited in that root-finding algorithms are unstable once the degree of a polynomial gets too large. In our study, this begins to occur when we consider the aggregate of around twenty streams with at least one AR coefficient. It is possible to greatly reduce the dimensionality however by first aggregating demand streams that are produced by identical or nearly-identical ARMA models. This is another direction for future research.
1
The situation is really identical for any player in the supply chain that might be observing multiple demand streams, where information sharing does not take place
2
The AR and MA degrees were chosen with the understanding that these degrees increase with the number of ARMA(1,1) streams subaggregated into a particular cluster, while trying to limit the complexity of the models being estimated.
3
The aggregate of two MA(1) streams is always MA(1)
4
If θ 1 = θ 2 , then θ = 0 and the result still holds.
5
The given direction of the inequalities must hold, otherwise θ 1 > θ 2 in (100)

References

  1. Fernando Bernstein, Sajad Modaresi, and Denis Saur´e. A dynamic clustering approach to data- driven assortment personalization. Management Science 2018, 65, 1949–2443.
  2. Peter, J. Brockwell and Richard A. Davis. Time Series: Theory and Methods, 1991. [Google Scholar]
  3. Avi Giloni, Clifford Hurvich, and Sridhar Seshadri. Forecasting and information sharing in supply chains under arma demand. IIE Transactions 2014, 46, 35–54. [CrossRef]
  4. Robert Kohn. When is an aggregate of a time series efficiently forecast by its past. Journal of Econometrics 1982, 18, 337–349. [CrossRef]
  5. Vladimir Kovtun, Avi Giloni, and Clifford Hurvich. The value of sharing disaggregated infor- mation in supply chains. European Journal of Operational Research 2019, 277, 469–478. [CrossRef]
  6. Shaohui Ma, Robert Fildes, and Tao Huang. The Case of SKU Retail Sales Forecasting with Intra- and Inter-Category Promotional Information. Lancaster Management School, working paper.
  7. Ali, H. Sayed and Thomas Kailath. A survey of spectral factorization methods. Numerical Linear Algebra with Applications 2001, 8, 467–496. [Google Scholar]
Figure 1. Runtimes of Pivot Clustering for Assigning Streams to Four Clusters Using Theoretical and Estimated MSFEs. The graph contains the average runtime of Pivot clustering for each specific number of streams being assigned to four clusters. That is, for N streams (between 10 and 20) we consider twenty randomly assigned ARMA(1,1) models for each stream. Pivot Clustering (with theoretical and estimated MSFE) is then used to obtain the four optimal clusters. The average runtime of Pivot is approximately 63 seconds when twenty streams are assigned to clusters using theoretical MSFE and 40 seconds using estimated MSFE.
Figure 1. Runtimes of Pivot Clustering for Assigning Streams to Four Clusters Using Theoretical and Estimated MSFEs. The graph contains the average runtime of Pivot clustering for each specific number of streams being assigned to four clusters. That is, for N streams (between 10 and 20) we consider twenty randomly assigned ARMA(1,1) models for each stream. Pivot Clustering (with theoretical and estimated MSFE) is then used to obtain the four optimal clusters. The average runtime of Pivot is approximately 63 seconds when twenty streams are assigned to clusters using theoretical MSFE and 40 seconds using estimated MSFE.
Preprints 85614 g001
Figure 2. Theoretical Subaggregated MSFE for Random Initialization of Pivot. Theoretical MSFEs are computed on the four clusters obtained by Pivot Clustering for different random initializations. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).
Figure 2. Theoretical Subaggregated MSFE for Random Initialization of Pivot. Theoretical MSFEs are computed on the four clusters obtained by Pivot Clustering for different random initializations. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).
Preprints 85614 g002
Figure 3. Estimated Subaggregated MSFE for Random Initialization of Pivot. Estimated MSFEs are computed for different initializations of Pivot Clustering. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).
Figure 3. Estimated Subaggregated MSFE for Random Initialization of Pivot. Estimated MSFEs are computed for different initializations of Pivot Clustering. The MSFE of the initial random assignment is provided as well as the MSFE that is obtained by Pivot Clustering. Horizontal lines are drawn to represent the MSFE based on the fully aggregated demand sequence (top) and the MSFE based on the fully disaggregated demand sequences (bottom).
Preprints 85614 g003
Figure 4. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.
Figure 4. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.
Preprints 85614 g004
Figure 5. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.
Figure 5. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation. We note that using individual streams to forecasts leads to the lowest MSFE while basing the forecast on the aggregate of the streams always leads to the highest MSFE. If subaggregated clusters are formed from the streams, the lowest MSFEs are obtained when clusters are based on Pivot Clustering.
Preprints 85614 g005
Figure 6. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.
Figure 6. Theoretical Subaggregated MSFE Found by Pivot for Different Sets of Streams. Theoretical MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.
Preprints 85614 g006
Figure 7. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.
Figure 7. Estimated Subaggregated MSFE Found by Pivot for Different Sets of Streams. Estimated MSFEs are computed for twenty simulations using different sets of twenty streams in each simulation and different initial assignments of streams to clusters. When averaging the final MSFEs based on the different initializations for each set of streams, we note that using individual streams to forecasts leads to the lowest averaged MSFE while basing the forecast on the aggregate of the streams always leads to the highest averaged MSFE. If subaggregated clusters are formed from the streams, the lowest averaged MSFEs are obtained when clusters are based on Pivot Clustering.
Preprints 85614 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Alerts
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated