Preprint
Article

Definition of Optimal Time Intervals in the Queues’ Analysis: The Use of Epsilon-Entropy and Epsilon-Capacity

Altmetrics

Downloads

39

Views

155

Comments

0

This version is not peer-reviewed

Submitted:

06 September 2024

Posted:

09 September 2024

Read the latest preprint version here

Alerts
Abstract
In the paper, we suggest a method for calculating optimal time intervals in the queue analysis. The suggested method processes partitioning of the time interval and utilizes the epsilon-entropy and epsilon-capacity of the partition for finding an optimal partition. Optimality of the partition is specified based on its epsilon-information. The suggested method is illustrated by defining the intervals for histograms of differently distributed samples and demonstrated its effectiveness in comparison with the existing methods.
Keywords: 
Subject: Computer Science and Mathematics  -   Applied Mathematics

MSC:  60K25; 90B22

1. Introduction

The use of queue models implies the knowledge of the arrival and departure rates, which in their turn require the well-defined time intervals [3].
For example, consider a clerk serving clients of some office during the day. Then, the arrival rate λ and the departure rate μ per day completely describe the state of the system at the end of the day but do not provide any information about the system during the day. On other hand, specification of the rates λ and μ , for example, per minute is also useless since both the clients and the clerk do not act with such rates.
The incorrect specification of the time intervals leads to incorrect consideration of the processes with unsteady arrivals or departures. In many cases, such situations are resolved using the queues with time dependent rates [5,8], but even in such considerations certain time intervals per which the rates are defined have to be specified.
Similar problem appears in statistics while plotting histograms of the data sample X and it is required to define the lengths δ of the bins. Since there is no strictly proven formula, which defines the bin length with respect to the number n of data counts or distribution over the sample, the heuristic formulas are used.
For example, the simplest heuristic defines the bin length as
δ 1 = max X min X n .
The Sturges rule [15] defines the number of bins as log 2 n + 1 . Then the bin length is
δ 2 = max X min X log 2 n + 1 .
The Scott rule [12] defines the bin length
δ 3 = 3.49 s n 3
with respect to the standard deviation s of the sample. Finally, the Freedman-Diaconis formula [2] uses the interquartile range instead of the standard deviation and defines the bin length
δ 4 = 2 Q 3 Q 1 n 3 ,
where Q 3 is the third quartile and Q 1 is the first quartile of the sample.
In general, this problem can be considered in the terms of discretization of stochastic processes [4], where it is required to build a discretization scheme, which is a sequence t i , i = 0 ,   1 ,   2 ,   , of stopping times such that Δ = t i + 1 t i and t i = i Δ . But if t i is a random variable, then the length Δ of the time intervals is also random and depend on the distribution of the considered process. Similarly, if the discretization scheme is regular with constant interval lengths Δ , then the increments of the process at the times t i are random.
For example, let W t be a Wiener process at the time interval T = 0 , t m starting with W 0 = 0 . In such a process, the increments d W t are independent and for any t i and t j > t i the differences W t j W t i have normal distribution N 0 , σ t 2 with the variance σ t 2 = t j t i . Assume that the interval T is divided to n sub-intervals with the length Δ = t m / n . Then, the stopping times are t i = i Δ , i = 0,1 ,   2 , , and the increments d W t = W t i + 1 W t i are normally distributed with σ t 2 = Δ .
In this paper, we seek the answer to the following question formulated by Yaakov Reis [9]. Given a total period, a sequence of clients arriving at the times t 0 ,   t 1 ,   t 2 , , t m to a service point, what is an optimal length Δ of the time interval, on which the arrival rate λ and the service rate μ (which is a departure rate) have to be defined?
An immediate answer to this question follows the heuristics used for definition of the bin length δ in histogram. However, such heuristics cannot be considered as the best method and their result is not strictly proved approximation.
To find an optimal length Δ we follow the line of the Schwarz information criterion [11] and apply well-known concepts of ε -entropy and ε -capacity, which were introduced by Kolmogorov and Tikhomirov [6]. The calculations of the optimal interval are also based on the concept of the entropy of partition introduced by Rokhlin [10].
Initially ε -entropy and ε -capacity were used for analysis of functions and functional spaces and then, as well as the entropy of partition, were applied to the studies of dynamical systems. For many examples of application of these concepts and their relationship with the Shannon entropy [13] see the paper by Dinaburg [1] and the books by Vitushkin [16] and by Sinai [14].

2. Problem Formulation

Let T = t 0 ,   t m be a time interval of the length t m t 0 > 0 and assume that during this interval sequentially occur m + 1 events a 0 , a 1 , a 2 , , a m . The times of occurrences of these events are t 0 t 1 t 2 t m , respectively.
The problem is to define a length Δ of the time interval or, that is the same, the stopping times t i = i Δ , i = 0 ,   1 ,   2 ,   , n , such that n intervals T i = t i , t i + 1 cover the interval T and such that they as better as possible represent the times t j , j = 0 ,   1 ,   2 , , m , when the considered events occurred.
To illustrate the problem, let us consider a simple example of a non-steady supply process. Assume that the mentioned above clerk serves the clients with the rate μ = 5 clients/hour. During the workday of 8 hours arrive 24 clients. Then, the arrival rate of the clients defined over a workday is λ = 24 / 8 = 3 clients/hour and the transition rate ρ = λ / μ = 3 / 5 < 1 that should guarantee that at the end of the workday all clients will be served.
Additionally, assume that in the morning, during the first two hours of the day, arrive 12 clients. Then during the next four hours the clients do not arrive and then, in the evening, during the last two hours of the day, arrive the last 12 clients. Thus, in the morning and in the evening the arrival rate is λ = 12 / 2 = 6 clients/hour, that means that the first 12 clients will wait in the queue and the last 12 clients will not be served until the end of the workday.
Certainly, such phenomena are well-known; in the queue theory they are solved using the state-dependent and time-dependent arrival rates [3,5,8], and in practice are overcome by adding the clerks in the morning and in the evening and by stopping the service in the midday. However, a prior definition of the appropriate time intervals can simplify further analysis and even decrease the expected number of varying rates.
Finally, note that the considered problem is essentially discrete problem, where it is required to split the discrete dataset. Together with that, since it is closely related to the discretization problems dealing with the continuous functions, below we will make some remarks on such problems as well.

3. Methods

The suggested solution of the problem is based on the concepts of ε -entropy and ε -capacity, which were introduced by Kolmogorov and Tikhomirov in the middle of 1950-s and presented in detail in their paper [6]. In addition, it uses the multiplication of partitions as it was implemented by Rokhlin [10] and by Sinai [14] in the studies of dynamical systems.

3.1. ε -Entropy and ε -Capacity

Let U R be a non-empty bounded set of a metric space R and let ε > 0 be a real number.
The set α = A : A R is called ε -covering of the set U , if U A α A and the diameter of any A α is not greater than 2 ε .
The set U is said to be ε -distinguishable, if any two of its distinct points are located at distance greater than ε .
Given a bounded set U R , for any ε > 0 there exists a finite ε -covering of U , and for any ε > 0 any ε -distinguishable set U R is finite.
Denote by N ε U the minimal number of the sets in ε -covering α of the set U , and by M ε U the maximal number of points in an ε -distinguishable subset of the set U .
The value
H ε U = log 2 N ε U
is called the ε -entropy of the set U , and the value
E ε U = log 2 M ε U
is called the ε -capacity of the set U .
These values are interpreted as follows: ε -entropy H ε U is a minimal number of bits required to transmit the set U with the precision ε , and ε -capacity E ε U is a maximal number of bits, which can be memorized by U with the precision ε .
Among the properties of ε -entropy H ε U and ε -capacity E ε U we will use the following fact [6]: given the bounded set U , both ε -entropy and ε -capacity as functions of ε are non-increasing with increasing ε .
Examples of calculation of the ε -entropy and ε -capacity of the sets in different metric spaces can be found in the paper by Kolmogorov and Tikhomirov [6] and in the book by Vitushkin [16].

3.2. ɛ-Entropy of Partition

Let β = B : B U be a partition of the set U R that is U = B β B and for any two sets B , B β holds B B = .
The entropy of partition is defined as follows [10,14]. Let μ be a non-negative measure on the set U such that μ = 0 and μ U = 1 . Then, μ B 0 ,   1 for any B β . The value
H μ β = B β μ B log 2 μ B
is called the entropy of partition. If μ is the probability measure on U , then the sets B β can be interpreted as events and the entropy H μ is equivalent to the Shannon entropy [13].
Assume that the partition β is finite and the number of the sets in β is N . Define the measure μ on the set U as follows:
μ B = 0 ,   B = 1 N ,   B ,   B U 1 ,   B = U
Then, the entropy of partition is reduced to the value
H μ β = log 2 N
Finally, if the diameter of any set B β is not greater than 2 ε , then the partition β is an ε -covering and called ε -partition. Then, the entropy H μ β of partition is equivalent to the ε -entropy H ε U of the set U = B β B defined by equation (5).
Let β = B : B U be ε -partition of the set U with ε = ε B and γ = C : C U be another ε -partition of the set U with ε = ε C . Multiplication of the partitions β and γ is the partition
β γ = D = B C :   B β , C γ .
Each set D β γ is a subset of some set B β and of some set C γ . Then it is said that β γ is a refinement of both β and γ ; this fact is denoted by β β γ and γ β γ . Hence, following the properties of the entropy of partition,
H μ β H μ β γ and H μ γ H μ β γ
and
Moreover, the entropy H μ β γ of the multiplication β γ of the partition β and γ is the ε -entropy H ε U of the set U with any ε min ε B , ε C ,   max ε B , ε C .
For the other properties of the entropy H μ and its application for analysis of dynamical systems see the paper [10] and the book [14].

4. Suggested Solution

Let T = t 0 ,   t m , t m > t 0 , be a time interval and let T = t 0 , t 1 ,   t 2 , , t m be the set of moments in which certain events occur. We assume that the moments t j have an increasing order such that t j < t j + 1 , j = 0 ,   1 ,   2 , , m 1 .
As it follows from the formulation of the problem, in the consideration below the interval T plays a role of the set U and the intervals t j , t j + 1 , j = 0 ,   1 ,   2 , , m 1 , are considered as the elements of ε -partitions of the interval T .
Given ε > 0 , the minimal number of sets in the ε -covering α of the set T
N ε T = t m t 0 2 ε .
Then, the ε -entropy of the set τ is
H ε T = log 2 N ε T = log 2 t m t 0 2 ε .
Let
ε m i n = 1 2 t m t 0 m 2
be a minimal value of ε for the set T . Then, the value
H ε m i n T = log 2 t m t 0 2 ε m i n = 2 log 2 m
is maximal ε -entropy of the set T .
Finally, assume that on the interval T two sets T 1 = t 1,0 , t 1,1 , t 1,2 , , t 1 , m 1 and T 2 = t 2,0 , t 2,1 , t 2,2 , , t 2 , m 2 , t 1,0 = t 2,0 and t 1 , m 1 = t 2 , m 2 , of moments are defined. Denote by τ 1 = t 1,0 , t 1,1 , t 1,1 , t 1,2 , , t 1 , m 1 1 , t 1 , m 1 partition of the interval T corresponding to the set T 1 and by τ 2 = t 2,0 , t 2,1 , t 2,1 , t 2,2 , , t 2 , m 2 1 , t 2 , m 2 partition of the interval corresponding to the set T 2 . The number of intervals in the partition τ 1 is m 1 and the number of intervals in the partition τ 2 is m 2 .
Then, since the multiplication τ 1 τ 2 is a refinement of each of the partitions τ 1 and τ 2 of the sets, the size m 1 2 of the partition τ 1 τ 2 is m 1 2 max m 1 ,   m 2 , and the entropy H μ τ 1 τ 2 of the multiplication τ 1 τ 2 is not smaller than the entropies H μ τ 1 and H μ τ 2 of the partitions τ 1 and τ 2 .
Hence, if
2 ε max j = 0,1 , 2 , , m 1 1 t 1 , j + 1 t 1 , j   and   2 ε max j = 0,1 , 2 , , m 2 1 t 2 , j + 1 t 2 , j ,
then, following equation (11),
H ε T 1 H ε T 1 T 2 and H ε T 2 H ε T 1 T 2 .
Following the line of the Schwarz information criterion [11], let us define ε -information of the set T .
Let τ be a partition corresponding to the set T and let τ ε be a partition corresponding to the set T ε = t ε , 0 , t ε , 1 , t ε , 2 , , t ε , m ϵ , t ε , 0 = t 0 and t ε , m ε = t m , in which t ε , j + 1 t ε , j = 2 ε , j = 0 , 1 , 2 , , m 2 . In the partition τ ε all intervals except the last are of the length 2 ε .
Denote by T T ε the set of moments corresponding to the multiplication τ τ ε of the partitions τ and τ ε . Then, ε -information of the set T is defined as follows
I ε T = H ε m i n T H ε T H ε T T ε .
In this formula, the first term represents the number of bits required to transmit the set T with maximal precision, the second term represents the number of bits required to transmit the set T with precision ε , and the last term represents the number of bits required to transmit the set T with precision ε using additional set T ε generated with precision ε . Thus, the value I ε T is the number of bits remained after the transmission of the set T with the precision ε . In the other words, ε -information of the set T characterizes the part of the set, which cannot be transmitted with the precision ε .
Using equations (13) and (15), formula (18) of ε -information can be simplified and written in the form
I ε T = 2 log 2 m log 2 t m t 0 2 ε H ε T T ε .
The value of the entropy H ε T T ε depends on the distribution of time moments t j T , j = 0 ,   1 ,   2 , ,   m , over the interval T . If the moments t j are distributed evenly, then T T ε = T and
H ε T T ε = H ε T = log 2 t m t 0 2 ε .
Note that in general case equation (20) does not hold and calculation of the entropy of multiplication of partitions is processed according to the algorithm presented by Function 1 (see section 5).
Similarly, the value of ε -capacity E ε T depends on the distribution of the time moments t j over the interval T . If the moments t j are distributed evenly such that t j + 1 t j = t j + 2 t j + 1 and t j + 1 t j > ε for any j = 0 ,   1 ,   2 , , m 2 , then
M ε T = t m t 0 ε
and
E ε T = log 2 M ε T = log 2 t m t 0 ε .
If the distribution of the moments t i is such that t m 1 t 0 ε , which means that all the moments except t m are located between t 0 and t m 1 , and t m t 0 > ε , then
M ε T = 2
and
E ε T = log 2 2 = 1 .
Finally, if t m t 0 ε , then the set T does not contain ε -distinguishable subset, and we assume that
M ε T = 1
and
E ε T = log 2 1 = 0 .
Calculation of ε -capacity in general case follows the algorithm of Function 2 (see section 5).
The length Δ of the time interval, which defines the stopping times t i = i Δ , i = 0 ,   1 ,   2 ,   , n , is defined as
Δ = 2 ε ,
where ε is such a value for which ε -information I ε T of the set T is as close as possible to ε -capacity E ε T of this set.
Note that given the set T , the entropy H ε m i n T is constant and both entropies H ε T and H ε T T ε as functions of ε are decreasing. Thus, ε -information I ε T increases with ε . Along with that, ε -capacity E ε T as function of ε decreases.
Hence, the problem of finding the length Δ is formulated as follows: given the set T , find the value of ε such that
I ε T E ε T m i n .
To illustrate the calculation of the length Δ , let us consider a simple example. Assume that the considered time interval is T = t 0 ,   t m and the set T = t 0 , t 1 ,   t 2 , , t m consists of the evenly distributed moments t j such that t j + 1 t j = t j + 2 t j + 1 for any j = 0 ,   1 ,   2 , , m 2 . Then (here for simplicity we omit the notion of ceiling),
I ε T E ε T = 2 log 2 m 2 log 2 t m t 0 2 ε log 2 t m t 0 ε = log 2 m 2 log 2 t m t 0 3 4 ε 3 .
Hence, according to the criterion (28), it is required to specify the value ε such that
t m t 0 3 4 ε 3 = m 2 ,
which is
ε = t m t 0 4 m 2 3 .
and finally
Δ = 2 t m t 0 4 m 2 3 .
For example, if T = 0 ,   10 and A = 0 ,   1 ,   2 , , 10 , then
Δ = 2 10 0 4 × 11 2 3 = 2.55 .
For comparison, the indicated above methods (1)-(4) of specifying the bin length in the histograms result in the following values:
-
the simplest rule:          δ 1 = max A min A m = 10 0 11 = 3.01 ,
-
the Sturges rule [15]:         δ 2 = max A min A log 2 m + 1 = 10 0 log 2 11 + 1 = 2.24 ,
-
the Scott rule [12]:          δ 3 = 3.49 s m 3 = 3.49 × 3.32 11 3 = 5.20 ,
-
the Freedman-Diaconis formula [2]:    δ 4 = 2 Q 3 Q 1 m 3 = 2 × 8 3 11 3 = 4.94 .
In the considered example, the interval length calculated using the suggested method is compatible with the lengths obtained using the methods of calculating the bin lengths, but for the other distributions the interval lengths can be strongly different.
Note again that in general case the interval lengths cannot be calculated using close formulas. In the next section we summarize the suggested methods in the form of an algorithm which is applicable to arbitrary data.

5. Algorithmic Implementation

We summarize the suggested solution in the form of an algorithm which can be directly implemented in any high-level programming language. In our trials we used the MATLAB® environment.
Algorithm 1. Computing an optimal interval length
Input: 
Set T = t 0 , t 1 ,   t 2 , , t m of time moments, t j < t j + 1 , j = 0 ,   1 ,   2 , , m 1 ; step s > 0 .
Output: 
Optimal interval length Δ .
1. 
Calculate ε m i n = t m t 0 / 2 m 2 {minimal value of ε , equation (14)}.
2. 
Calculate H ε m i n T = 2 log 2 m {maximal ε -entropy, equation (15)}.
3. 
For ε = ε m i n to t m t 0 / 2 with step s do:
4. 
  Calculate H ε A = log 2 t m t 0 / 2 ε { ε -entropy, equation (13)}.
5. 
  Create set T ε = t ε , 0 , t ε , 1 ,   t ε , 2 , , t ε , m ε such that t ε , j < t ε , j + 1 , j = 0 ,   1 ,   2 , , m ε 1 , and t ε , j + 1 t ε , j = 2 ε , j = 0 ,   1 ,   2 , , m ε 2 .
6. 
  Compute H ε T T ε = e p s _ e n t r o p y T , T ε {entropy of T T ε , Function 1}
7. 
  Calculate I ε T = H ε m i n T H ε T H ε T T ε { ε -information, equation (18)}.
8. 
  Compute E ε T = e p s _ c a p a c i t y T , ε { ε -capacity, Function 2}.
9. 
  If I ε T > E ε T then
10.
    Break
11.
  End if.
12.
End for.
13.
Return Δ = 2 ε .
The algorithm includes two functions, e p s _ e n t r o p y T , T ε and e p s _ c a p a c i t y T , ε which are defined as follows.
Function 1.  e p s _ e n t r o p y T , T ε
Input: 
Set T = t 0 , t 1 ,   t 2 , , t m of time moments, t j < t j + 1 , j = 0 ,   1 ,   2 , , m 1 ; set T ε = t ε , 0 , t ε , 1 ,   t ε , 2 , , t ε , m ε of time moments, t ε , j < t ε , j + 1 , j = 0 ,   1 ,   2 , , m ε 1 .
Output: 
ε -entropy H ε T T ε of the set T T ε .
  • Join the sets T and T ε : T j o i n t = T T ε .
  • Find the number N T j o i n t of elements in the set T j o i n t .
  • Set N ε T j o i n t = N T j o i n t 1 .
  • Set H ε T T ε = log 2 N ε T j o i n t .
  • Return H ε T T ε .
The function e p s _ e n t r o p y was implemented in MATLAB® by concatenation of the sets T and T ε using the function cat with further removing of the doubling elements by the function unique.
Function 2.  e p s _ c a p a c i t y T , ε
Input: 
Set T = t 0 , t 1 ,   t 2 , , t m of time moments, t j < t j + 1 , j = 0 ,   1 ,   2 , , m 1 ; radius ε > 0 .
Output: 
ε -capacity E ε T of the set T .
1. 
If t m t 0 ε then
2. 
  Set M ε T = 1 .
3. 
Else
4. 
  Set M ε T = 2 .
5. 
  Set j = 0 .
6. 
  For i = 1 to m 1 do:
7. 
    If t i t j ε or t m t i ε then
8. 
      Continue.
9. 
    Else
10.
      Set M ε T = M ε T + 1 .
11.
      Set j = i .
12.
    End if.
13.
  End for.
14.
End if.
15.
Set E ε T = log 2 M ε T .
16.
Return E ε T .
The function e p s _ c a p a c i t y computes the number M ε T of ε -distinguishable elements in the set T for given ε and then computes E ε T as l o g 2 of this number.
Time complexity C of Algorithm 1 includes the following terms: O 1 – complexity of the lines 1-4; O m – complexity of the line 5; O m log m – complexity of the line 6; O 1 – complexity of the line 7; O m – complexity of the line 8 and O 1 – complexity of the lines 9-13. Then, time complexity of each iteration of the algorithm is O m log m . The maximal number of iterations is n = t m t 0 / 2 s ; hence complexity of Algorithm 1 is
C = n × O m log m .
Convergence of Algorithm 1 is guaranteed by the indicated above fact that ε -information I ε T increases with increasing ε while ε -capacity E ε T decreases with increasing ε . Since the interval T = 0 ,   t m is bounded, the difference between increasing ε -information I ε T and decreasing ε -capacity E ε T has its minimum in T , which is a terminating point of the algorithm.
Dependence of the functions I ε T and E ε T on the interval length Δ = 2 ε is illustrated in Figure 1.
The computed interval is Δ = 14.21 . For this interval and ε = Δ / 2 = 7.10 , the values of ε -information and ε -capacity are I ε T = E ε T 3.7 bit. Note that the accuracy of computing the interval Δ increases with decreasing the step s .

6. Examples

First, let us consider the examples of computing the interval lengths for different distributions of time intervals. In all considered cases we assume that the length of time interval T = 0 ,   t m is t m = 100 and m = 100 .
The data were generated by the MATLAB® function random with respect to the distribution created by the MATLAB® function makedist. In the examples, we used uniform distribution with a = 0 and b = t m , normal distribution with μ = t m / 2 and σ = t m / 6 , and exponential distribution with μ = 2 .
The obtained interval lengths Δ were used as bin lengths δ in the histograms. For comparison, we present the histograms plotted with the bin lengths δ calculated using the Scott rule (see equation (3)), which is also used as a basis for a default method in MATLAB®. The resulting histograms are shown in Figure 2.
The values of the interval lengths Δ are:
-
evenly distributed data:
Δ = 14.21, δ1 = 9.90, δ2 = 12.95, δ3 = 21.81 and δ4 = 21.54,
-
uniform distribution with a = 0 and b = t m :
Δ = 14.11, δ1 = 9.83, δ2 = 12.87, δ3 = 22.69 and δ4 = 22.65,
-
normal distribution with μ = t m / 2 and σ = t m / 6 :
Δ = 14.91, δ1 = 8.27, δ2 = 10.83, δ3 = 13.11 and δ4 = 10.38,
-
exponential distribution with μ = 2 :
Δ = 1.20, δ1 = 1.07, δ2 = 1.41, δ3 = 1.35 and δ4 = 0.91.
The suggested method results in the interval lengths Δ that are close to the interval lengths δ provided by the conventional methods with respect to the distribution of the data. In fact, for evenly and uniformly distributed data interval length Δ is close to the interval length δ 2 resulted by the Sturges method, for normal distribution δ 2 < Δ < δ 3 and for exponential distribution δ 1 < Δ < δ 2 .
Now let us consider the use of the suggested algorithm for specification of the arrival rates λ and corresponding service rates μ . Assume that the office, where the mentioned above clerk works, serves 480 clients 8 hours during the day that is T = 8 × 60 = 480 minutes. Also, assume that the clients arrive by three “waves” – in the morning, in the midday and in the evening. The histogram of the number of clients during the day is shown in Figure 3.a. In this histogram the bin length is computed by the Scott rule (value δ 3 below).
The values of the interval lengths Δ for this distribution are:
Δ = 22.0, δ1 = 21.91, δ2 = 48.45, δ3 = 67.99 and δ4 = 66.41.
Histogram of the number of clients during the day with the bin length δ = Δ computed by the suggested algorithm is shown in Figure 3.b. Dependence of the functions I ε A and E ε A on ε for this distribution is shown in Figure 3.c.
From the results of computations of the interval length Δ and the bin lengths δ it follows that by the suggested algorithm the arrival rates during a day should be calculated each 22 minutes, while by the Scott they should be calculated each 68 minutes. Thus, for multimodal distribution the suggested algorithm results in shorter intervals that provides more exact representation of the data.

7. Conclusion

In the paper, we suggested the method of calculating optimal time intervals required for definition of arrival and departure rates. The method is useful for specification of the bin lengths in histograms, especially for the data with multimodal distributions.
The method utilizes the Kolmogorov and Tikhomirov ε -entropy and ε -capacity and the Rokhlin entropy of partition. Optimality of the partition is defined basing on the ε -information.
The procedure is presented in the form of a ready-to-use algorithm, which was compared with the known methods used for calculation of the interval lengths in histograms and demonstrated its robustness and correct sensitivity to the data.

Funding

This research has not received any grant from funding agencies in the public, commercial, or non-profit sectors.

Competing interests

The authors declare no competing interests.

References

  1. Dinaburg, E.I. On the relations among various entropy characteristics of dynamical systems. Math. USSR Izvestija 1971, 5, 337–378. [Google Scholar] [CrossRef]
  2. Freedman, D.; Diaconis, P. On the histogram as a density estimator: L2 theory. Zeit. Wahrscheinlichkeitstheorie und Verwandte Gebiete 1981, 57, 453–476. [Google Scholar] [CrossRef]
  3. Gross, D.; Shortle, J.F.; Thompson, J.M.; Harris, C.M. Fundamentals of Queueing Theory, 4th ed.; John Wiley & Sons: Hoboken, NJ, 2008. [Google Scholar]
  4. Jacod, J.; Protte, P. Discretization of Processes; Springer: Berlin, 2012. [Google Scholar]
  5. Keller, J.B. Time-dependent queues. SIAM Review 1982, 24, 401–412. [Google Scholar] [CrossRef]
  6. Kolmogorov, A.N.; Tikhomirov, V.M. ɛ-entropy and ɛs-capacity of sets in functional spaces. Amer. Mathematical Society Translations, Ser. 2 1961, 17, 277–364. [Google Scholar]
  7. Lawler, G.F. Introduction to Stochastic Processes; Chapman & Hall: New York, 1995. [Google Scholar]
  8. Newell, G.F. Queues with time-dependent arrival rates (I-III). J. Applied Probability, 1968, 5(2), 436-451 (I); 5(3), 579-590 (II); 5(3), 591-606 (III).
  9. Reis, Y. Private conversation. Ariel University, Ariel, March 2021. [Google Scholar]
  10. Rokhlin, V.A. New progress in the theory of transformations with invariant measure. Russian Mathematical Surveys 1960, 15, 1–22. [Google Scholar] [CrossRef]
  11. Schwarz, G. Estimating the dimension of a model. Annals of Statistics 1978, 6, 461–464. [Google Scholar] [CrossRef]
  12. Scott, D.W. On optimal and data-based histograms. Biometrika 1979, 66, 605–610. [Google Scholar] [CrossRef]
  13. Shannon, C. A mathematical theory of communication. The Bell System Technical Journal 1948, 27, 379–423. [Google Scholar] [CrossRef]
  14. Sinai, Y.G. Topics in Ergodic Theory.; Princeton University Press: Princeton, 1993. [Google Scholar]
  15. Sturges, H. The choice of a class-interval. J. Amer. Statistics Association 1926, 21, 65–66. [Google Scholar] [CrossRef]
  16. Vitushkin, A.G. Theory of Transmission and Processing of Information; Pergamon Press: New York, 1961. [Google Scholar]
Figure 1. Dependence of ε -information I ε T and ε -capacity E ε T on the interval length Δ = 2 ε for the set T of m = 100 evenly distributed time moments; T = 0 ,   t m , t m = 100 and s = 1 .
Figure 1. Dependence of ε -information I ε T and ε -capacity E ε T on the interval length Δ = 2 ε for the set T of m = 100 evenly distributed time moments; T = 0 ,   t m , t m = 100 and s = 1 .
Preprints 117517 g001
Figure 2. Histograms of the data plotted using the bin lengths computed by the Scott rule (figures (a) for each distribution) and using the bin lengths δ = Δ computed by the suggested algorithm (figures (b) for each distribution).
Figure 2. Histograms of the data plotted using the bin lengths computed by the Scott rule (figures (a) for each distribution) and using the bin lengths δ = Δ computed by the suggested algorithm (figures (b) for each distribution).
Preprints 117517 g002
Figure 3. Arrivals of the clients during a day: (a) histogram with the bin length computed by the Scott rule; (b) histogram with the bin length δ = Δ computed by the suggested algorithm; (c) dependences of ε -information I ε A and ε -capacity E ε A on ε for the set of arrival times.
Figure 3. Arrivals of the clients during a day: (a) histogram with the bin length computed by the Scott rule; (b) histogram with the bin length δ = Δ computed by the suggested algorithm; (c) dependences of ε -information I ε A and ε -capacity E ε A on ε for the set of arrival times.
Preprints 117517 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated