3.1. Density Approximation
The GMM expansion aims to approximate densities without relying on the specific form of the target distributions.
Figure 1 shows cases of two density approximation. The left side of
Figure 1 displays the target densities, while the right side shows these target densities alongside their approximations using our GMM method. In the upper plot, the target density is a mixture of continuous and smooth distributions, using five normal distributions. The lower plot features a target distribution that combines normal, T, and uniform distributions.
In terms of approximation accuracy, the proposed method performed exceptionally well for both the smooth and non-smooth cases. The overall and most crucial features of the target densities were well-preserved, although some details of non-smooth densities were difficult to recover. Considering that the Gaussian distributions are inherently smooth, we expected the GMM to perform better when approximating smooth target densities. Geometrically, when target distributions consist of components with bell-shaped structures, GMM is a particularly suitable choice for approximation. However, in cases like the target distribution in the lower plot, which includes a uniform distribution among its components, GMM may struggle to capture fine details unless a sufficiently large number of components are used. According to the GMM expansion idea, GMM is expected to capture finer structures, such as deep cuts and straight lines, with a larger set of basis components. This expectation is consistent with the theoretical proof provided in
Section 2. As long as GMM captures most of the critical features with a certain level of accuracy, the approximation can be considered good. Further numerical justification is provided later in this section. Further numerical justification is provided later in this section.
The learning process is illustrated in
Figure 2. Initially, the density of our GMM closely resembles a uniform distribution before training. As the training progresses and the dataset is processed, the density gradually conforms to the original target density. Given that the base distributions are fine and dense, we can achieve a good approximation without needing to learn the parameters
and
for each Gaussian distribution. Unlike traditional methods where GMM requires learning
and
, fixing
and
as bases and smoothness, especially in neural network, which we will present in section 4 adjusting the density through
offers advantages, particularly in neural network applications, as discussed in
Section 4. Additional approximation examples are shown in Appendix
Figure A5.
After graphically demonstrating the GMM expansion and learning algorithm, a quantitative study was conducted to further discuss the properties of the proposed method. In these experiments, the total variance distance(TVD) from
Section 2 was applied. Since
was predetermined, TVD is denoted as
L and is calculated as follows.
Measuring the estimation accuracy using Equation (
16) has several advantages. First, it is easy to calculate. Second, even without the target density in its current form, we can still compute the loss using a discrete statistical estimation. The minimum and maximum values of the loss, which were
, are clearly evident. Evidence is provided below.
That means .
In this quantitative study, the proposed method was used to approximate two sets of randomly generated distributions, calculating the average TVD using Equation (
16) between the target distribution and the approximated GMM.
Table 1 and
Table 2 present the results of using our method to learn two types of distributions. Similar to
Figure 1, the target distributions in
Table 1 are randomly generated Gaussian mixture distributions. We sampled 5000 data points for learning and used GMM with 200 components for learning. More approximation examples are shown in Appendix
Figure A5 subplots (1)–(3).
Table 2 presents the results of approximating target distributions containing Normal, T, and uniform distributions, with examples shown in Appendix
Figure A5 subplots (4)–(6). Since
, the average
L for both types of distributions are less than
. This indicates that approximation error is considerably low, with the margin of error for our experiments being less than
. In the later part of the section, we present a direct comparison with other methods. The results also indicate that the GMM generally performs better in approximating smooth target densities. Another important factor in the algorithm is the hyperparameter
.
Table 1 and
Table 2 show how changes in
affect the accuracy of the algorithm. The approximation accuracy benefits from a slightly larger
for smooth densities, but a smaller
works better for non-smooth densities. However, the overall improvement in
L function was less than
, indicating that the benefits are not significant. This result supports treating
as a hyperparameter. Depending on the situation,
can be set smaller (e.g.,
) if a closer match to the original dataset is required or larger (e.g.,
) for smoother distributions.
The above experiment discusses approximation accuracy with the same component distribution sizes but varying
. It is important to address how the sizes of the component distributions and the dataset affect the approximation accuracy. The experimental results shown in
Table 3 demonstrates the performance of the proposed method under various setups. Similar to the experiments in
Table 2 and
Table 1, the average
L over approximations of 30 different randomly generated mixture densities was calculated. A GMM with component sizes ranging from 10 to 1000 was tested.
Table 1,
Table 2, and
Table 3 show how increasing the data size from 5,000 to 50,000 affects approximation accuracy. The findings indicate that the proposed method benefits from a large dataset. Results from
Table 1 and
Table 2 show that with 5,000 data points and 200 components, the approximation accuracy was improved from 0.04369 to 0.01475 when the data size increased to 20,000. Further increasing the number of data points to 50,000, provided only a slight additional benefit.
Table 3 indicates increasing the component size from 10 to 200 delivers significant improvements. However, increasing the component size to 500 and 1000 does not necessarily result in better performance. An approximation with 500 components and 50,000 data points achieved the best accuracy with
, but the improvement over 200 components (
) was minimal. These findings align with our intuitive understanding of GMM expansion, confirming that a certain number of basis components is necessary to ensure a good approximation. Increasing data points and components helps capture finer details, but determining the optimal number of components relative to the dataset size lacks a clear formula. Nonetheless, our experiments show that the proposed method is robust with 200 components.
The aforementioned experiments were conducted using one-dimensional settings. The proposed method as also applicable in high-dimensional cases. A two dimensional example is shown in
Figure 3. In Fourier expansion, increasing dimensions exponentially enlarges the size of the base frequency components. The proposed method faces a similar problem, managing higher-dimensions requires more bases to capture inter-dimensional correlations. To address this, clustering methods such as K-means are used with the EM algorithm, where cluster sizes are predetermined. The BVI method attempts to solve this problem by optimizing the posterior distribution through techniques like mean-field approximation or Markov chain monte-carlo (MCMC). However, these methods each have limitations. For example, the EM algorithm can suffer from singularities, and the BVI approach requires prior and posterior distributions, which introduces implicit biases. Higher-dimensional density estimation is more complex compared to existing methods due to the correlation between dimensions. We plan to further develop this method and discuss future work. In the next subsection, we will directly compare the proposed method with state-of-the-art distribution approximation techniques, particularly EM and BVI.
3.2. Comparison Study
EM and BVI are state-of-the-art density approximation methods [
33,
35,
37].
Scikit-learn1 provides a well-built Python package for implementing these methods, alongwith detailed explanation and summaries. In this subsection, we compare our proposed method with EM and BVI. The algorithms for both the EM and BVI were obtained from
Scikit-learn.
Figure 4.
Experiment Density Set
Figure 4.
Experiment Density Set
The comprehensive performance comparison results for the proposed method, EM, and BVI are presented in
Table 4. Accuracy and training time were compared, with all values presented as mean and standard deviation. Two variations of the EM algorithm were tested: EM K-means, which initializes parameters using the K-means methods, and EM-Random, which uses randomly selected data points. Similar to the experiments in
Table 1,
Table 2, and
Table 3, a set of mixture distributions was randomly generated as illustrated in
Figure 4. For each mixture distribution, 2000 data points were sampled for training. The component sizes ranged from 20 to 1000. We used the Total Variation Distance (TVD) to evaluate approximation accuracy. Since the cumulative density function is not provided in
Scikit-learn, we estimated the TVD using a slightly modified approach. The equation for this estimation is as follows:
This experiment demonstrates that the proposed method is superior to both EM and BVI. Our method, with a component size of 200, had the lowest TVD as well as the lowest training time compared to the other methods. Various factors and fundamentals can cause the EMs and BVI to underperform when the number of Gaussians increases. We consider one of the most important factor is . How we handle is also the most notable difference between our method and others. Learning throughout the training process can cause degeneration. When there are insufficiently many points for a mixture components, may tend toward zero and infinite likelihood may occur or the mixture component may fall into singularity. This requires artificial regularization of the covariance and causes instability when applying the GMM in other models, such as neural networks. In the next section, we describe four applications that demonstrate how a neural network can benefit from the proposed method.