A Geometric Interpretation of the Multivariate Gaussian Distribution and its Entropy and Mutual Information

Preprint

Concept Paper

A Geometric Interpretation of the Multivariate Gaussian Distribution and its Entropy and Mutual Information

Altmetrics

Downloads

135

Views

Comments

This version is not peer-reviewed

Submitted:

25 May 2023

Posted:

26 May 2023

You are already at the latest version

Alerts

Abstract

The fundamental objective is to study the application of multivariate sets of data in Gaussian distribution. This paper examines broad measurements of structure for both Gaussian and non-Gaussian distributions, which shows that they can be described in terms of the infor-mation-theoretic between the given covariance matrix and correlated random variables (in terms of relative entropy). In order to develop the multivariate Gaussian distribution with entropy and mutual information, several significant methodologies are presented through the discussion supported by illustrations, both technically and statistically. The content obtained allows readers to better perceive concepts, comprehend techniques, and properly execute software programs for future study on the topic's science and implementations. It also helps readers grasp the themes' fundamental concepts. Involving the relative entropy and mutual information as well as the potential correlated covariance analysis based on differential equations, a wide range of information is addressed, including basic to application concerns.

Keywords:

Subject: Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Understanding the ways knowledge concerning an external variable, or the reciprocal information of its parts, is distributed across the parts of a multivariate system can assist characterize and infer the underlying mechanics and function of the system. This goal has driven the development of several techniques for dissecting the elements of a set of variables’ combined entropy or for dissecting the contributions of a set of variables to the mutual information about the variable of interest. In actuality, this association and its modifications exist for any input signal and the widest range of Gaussian pathways, comprising discrete-time and continuous-time pathways in scalar or vector forms.

In a more general way, mutual information and mean-square error are the fundamental concepts of information theory and estimating theory, respectively. In contrast to the MMSE, which determines how precisely each input sample can be restored using the channel’s outcomes, the input-output mutual information is an estimation of whether the information can be consistently delivered over a channel given a specific input signal. An inactive functioning characterization for mutual information is provided by the substantial relevance of mutual information to estimate and filtering. Therefore, the significance of identity is not only obvious, but the link is also fascinating and merits an in-depth explanation [1,2,3]. Relations between the MMSE of the approximation of the output given the input and the localized actions of the mutual information at diminishing SNR are presented in [4]. [6] gives the idea about the probabilistic ratios of geometric characteristics of signal detection in Gaussian noise. Furthermore, whether in a continuous-time [5,6,7] or discrete-time setting [8] context, the likelihood ratio is difficult in the relationship between observation and estimation [9].

Considering the specific instance of parametric computation (or Gaussian inputs), correlations relating to causal and non-causal estimation errors have been investigated in [10,11], involving the limit on the loss owing to the causality restriction is specified. Knowing how data pertaining of an external parameter, or inversely related data within its parts, distributes across the parts of a multivariate system can assist categorize and determining the fundamental mechanics and functionality of the structure. This goal served as the impetus for the development of various techniques for decomposing the various elements of a set of parameters’ joint entropy [12,13] or to deconvolute the additions of a set of elements to the mutual information about a target variable [14]. These techniques can be used to examine a variety of intricate systems, including those in the physical distinctions domain, such as gene networks [15] or brain coding [16], as well as those in the social domain, such as selection agents [17] and community behavior [18]. They can also be used to analyze artificial agents [19]. Additionally, some new proposals diverge more significantly from the original framework, either through the adoption of novel principles, the consideration of the presence of detrimental elements linked to erroneous, or the implementation of joint entropy subdivisions in place of mutual information [20,21].

In the multivariate scenario, the challenges of breaking down mutual information into redundancy and complimentary sections have nevertheless been significantly increased. The novel redundancy determines that were initially developed are only defined for the bivariate situation [24,25], or allow negative components [26], whereas measurements of coordination are more readily extended to the multivariate case, especially when using the maximum entropy architecture [22,23]. By either utilizing the associations between lattices formed by various numbers of parameters or utilizing the multiple interactions between redundant lattices and information loss lattices, for which collaborative efforts are more actually defined, the study in [27] established two analogous techniques for constructing multivariate redundant metrics. The maximum entropy framework allows for a more straightforward generalization of the efficiency measurements to the multivariate case [24,25].

In the present study, we propose an extension of the bivariate Gaussian distribution technique to calculate multivariate redundant metrics inside the maximum entropy context. The importance of the maximum entropy approach in the bivariate scenario, where it offers constraints for the actual redundancy, unique information, and efficiency terms under logical presumptions shared by additional criteria, acts as the motivation for this particular focus [24]. The maximum entropy measurements, specifically, offer a lower limit for the actual cooperation and redundant terms and a higher limit for the actual specific information if it is presumed that a bivariate non-negative disintegration exists and that redundancy can be calculated from the bivariate distributions of the desired outcome with every source. Furthermore, if these bivariate distributions are consistent with possibly having little interaction under the previous hypotheses, then the maximum entropy decomposition returns not only boundaries but also the precise actual terms. Here, we demonstrate that, under similar presumptions, the maximum entropy reduction also plays this dominant role in the multivariate situation.

The remainder of this paper is organized as follows. A brief review of the geometry of the Gaussian distribution is reviewed in Section 2. The consecutive three sections deal with various important topics on information entropy with illustrative examples with emphasis on visualization of the information and discussion. In Section 3, continuous entropy/differential entropy is presented. In Section 4, the relative entropy (Kullback-Leibler divergence) is presented. Mutual information is presented in Section 5. Conclusions are given in Section 6.

2. Geometry of the Gaussian Distribution

In this section, the background relations on Gaussian distribution for different parametric point of view has been discussed. The exploratory analysis’s fundamental objective is to identify “the framework” in multivariate datasets. Ordinary least-squares regression and principal component analysis (PCA), respectively, offer typical measurements for dependency (the predicted connection between particular components) and rigidity (the degree of prominence of the probability density function (pdf) around a low-dimensional axis) for bivariate Gaussian distributions. Mutual information, an established measure of dependency, is not an accurate indicator of rigidity since it is not invariant with an opposite rotation of the parameters. For bivariate Gaussian distributions, a suitable rotating invariant compactness measure is constructed and demonstrated to reduce the corresponding PCA measure.

The Gaussian pdf (a) does not have a framework in either of the above-described definitions and represents the independent variables without any settling around a lower-dimensional region. The Gaussian pdf (b), on the other hand, has greater variance along one axis over another. Despite being independent, their combined pdf is small. Although the variables are associated and therefore likewise characterized by dependency, the Gaussian pdf (c) is equally focused around one dimension as is (b).

2.1. Standard Parametric Represenatation of an Ellipse

If the data is uncorrelated and therefore has zero covariance, the ellipse is not rotated and axis aligned. The radii of the ellipse in both directions are then the variances. Geometrically, a not rotated ellipse at point (0,0) and radii

a

and

b

for the

x_{1}

- and

x_{2}

-direction is described by

{(\frac{x}{a})}^{2} + {(\frac{y}{b})}^{2} = 1

(1)

Figure 1 represents the construction of single points of an ellipse is due to de La Hire. It is based on the standard parametric representation.

The general probability density function for the multivariate Gaussian is given by

f_{X} (x | μ, \sum) = \frac{1}{{(\sqrt{2 π})}^{n} | \sum |^{1 / 2}} e^{\{- \frac{1}{2} {(X - μ)}^{T} \sum^{- 1} (X - μ)\}}

(2)

where

μ = E [X]

\sum = Cov (X) = E [(X - μ) {(X - μ)}^{T}]

is symmetric, positive semi-definite matrix. If

\sum

is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between

X

and

μ

For bivariate Gaussian distributions with zero mean, the pdf can be expressed as

f_{X} (x | μ, \sum) = \frac{1}{2 π \sqrt{| Σ |}} e^{\{- \frac{1}{2} {(X - μ)}^{T} \sum^{- 1} (X - μ)\}}

(3)

and mean and covariance matrix are given by

μ = [\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}]; \sum = [\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{12} & σ_{2}^{2} \end{matrix}] = [\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}]

(4)

respectively, where the linear correlation coefficient

| ρ | \leq 1

Variance measures the variation of a single random variable, whereas covariance is a measure of how much two random variables vary together. With the covariance we can calculate entries of the covariance matrix, which is a square matrix. In addition, the covariance matrix is symmetric. The diagonal entries of the covariance matrix are the variances, however the other entries are the covariances. Due to this cause, the covariance matrix is often called as the variance-covariance matrix.

2.2. The Confidence Ellipse

A typical way to visualize two-dimensional Gaussian distributed data is plotting a confidence ellipse. The distance

d_{M} = {(X - μ)}^{T} \sum^{- 1} (X - μ)

is a constant value referred to as the Mahalanobis distance, which is a random variable distributed by the chi-squared distribution, denoted as

χ_{k}^{2}

P [{(X - μ)}^{T} \sum^{- 1} (X - μ) \leq χ_{k}^{2} (α)] = 1 - α

(5)

where

k

is the number of degree of freedom and

α

is the given probability related to the confidence ellipse. For example, if

α = 0.95,

95% confidence ellipse is defined. Extension from Equation (1), the radius in each direction is the standard deviation

σ_{1}

and

σ_{2}

parametrized by a scale factor s, known as the Mahalanobis radius of the ellipsoid:

{(\frac{x_{1}}{σ_{1}})}^{2} + {(\frac{x_{2}}{σ_{2}})}^{2} = s

(6)

The goal must be to determine the scale s such that confidence p is met. Since the data is multivariate Gaussian distributed, the left hand side of the equation is the sum of squares of Gaussian distributed samples, which follows a χ² distribution. A χ² distribution is defined by the degrees of freedom and since we have two dimensions, the number of degrees of freedom is also two. We now want to know the probability that the sum and therefore s has a certain value under a χ² distribution.

This ellipse, also a probability contour, defines the region of a minimum area (or volume in multivariate case) containing a given probability under the Gaussian assumption. This equation can be solved using a χ² table or simply using the relation

s = - 2 \ln (1 - p)

. The confidence interval can be evaluated through

p = 1 - e x p (- 0.5 s)

. For

s = 1

, we have

p = 1 - e x p (- 0.5) \approx 0.3935

. Furthermore, typical values include

s = 2.279

s = 4.605

s = 5.991

and

s = 9.210

for

p = 0.68

p = 0.9

p = 0.95

and

p = 0.99

, respectively. The ellipse can then be drawn with radii

σ_{1} \sqrt{s}

and

σ_{2} \sqrt{s}

Figure 2. Relation of the confidence interval and the scale factor s.

The Mahalanobis distance accounts for the variance of each variable and the covariance between variables.

\begin{array}{l} {(X - μ)}^{T} \sum^{- 1} (X - μ) \\ = [\begin{matrix} x_{1} - μ_{1} & x_{2} - μ_{2} \end{matrix}] {[\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}]}^{- 1} [\begin{matrix} x_{1} - μ_{1} \\ x_{2} - μ_{2} \end{matrix}] \\ = [\begin{matrix} x_{1} - μ_{1} & x_{2} - μ_{2} \end{matrix}] \frac{[\begin{matrix} σ_{2}^{2} & - ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{1}^{2} \end{matrix}]}{σ_{1}^{2} σ_{2}^{2} (1 - ρ^{2})} [\begin{matrix} x_{1} - μ_{1} \\ x_{2} - μ_{2} \end{matrix}] \\ = \frac{1}{1 - ρ^{2}} (\frac{{(x_{1} - μ_{1})}^{2}}{σ_{1}^{2}} - \frac{2 ρ (x_{1} - μ_{1}) (x_{2} - μ_{2})}{σ_{1} σ_{2}} + \frac{{(x_{2} - μ_{2})}^{2}}{σ_{2}^{2}}) \end{array}

(7)

Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data. In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.

In the general case, covariances

σ_{12}

and

σ_{21}

are not zero and therefore the ellipse-coordinate system is not axis-aligned. In such case, instead of using the variance as a spread indicator, we use the eigenvalues of the covariance matrix. The eigenvalues represent the spread in the direction of the eigenvectors, which are the variances under a rotated coordinate system. By definition a covariance matrix is positive definite therefore all eigenvalues are positive and can be seen as a linear transformation to the data. The actual radii of the ellipse are

\sqrt{λ_{1}}

and

\sqrt{λ_{2}}

for the two eigenvalues

λ_{1}

and

λ_{2}

of the scaled covariance matrix

s \cdot \sum

Based on Equations (3) and (7), the bivariate Gaussian distributions can be represented as

f (x_{1}, x_{2}) = \frac{1}{2 π σ_{1} σ_{2} \sqrt{1 - ρ^{2}}} e^{- \frac{1}{2} (\frac{1}{1 - ρ^{2}}) \{\frac{{(x_{1} - μ_{1})}^{2}}{σ_{1}^{2}} - 2 ρ \frac{(x_{1} - μ_{1}) (x_{2} - μ_{2})}{σ_{1} σ_{2}} + \frac{{(x_{2} - μ_{2})}^{2}}{σ_{2}^{2}}\}}

(8)

Level surface of

f (x_{1}, x_{2})

are concentric ellipses

\frac{{(x_{1} - μ_{1})}^{2}}{σ_{1}^{2}} - 2 ρ \frac{(x_{1} - μ_{1}) (x_{2} - μ_{2})}{σ_{1} σ_{2}} + \frac{{(x_{2} - μ_{2})}^{2}}{σ_{2}^{2}} = c

(9)

where

c

is the Mahalanobis distance possessing the following properties:

It accounts for the fact that the variances in each direction are different.
It accounts for the covariance between variables.
It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

The length of the ellipse axes are a function of the given probability of the chi-squared distribution with 2 degrees of freedom

χ_{2}^{2} (α)

, the eigenvalues

λ = {[\begin{matrix} λ_{1} & λ_{2} \end{matrix}]}^{T}

and the linear correlation coefficient

ρ

. If

α = 0.95,

95% confidence ellipse is defined by

[\begin{matrix} x_{1} - μ_{1} & x_{2} - μ_{2} \end{matrix}] \sum^{- 1} [\begin{matrix} x_{1} - μ_{1} \\ x_{2} - μ_{2} \end{matrix}] \leq χ_{2}^{2} (0.05)

(10)

where

\sum^{- 1} = \frac{1}{σ_{1}^{2} σ_{2}^{2} (1 - ρ^{2})} [\begin{matrix} σ_{2}^{2} & - ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{1}^{2} \end{matrix}]

\sum

denotes a symmetric matrix, the eigenvectors of

\sum

is linearly independent (or orthogonal).

2.3. Similarity Transform

The simplest similarity transformation method for eigenvalue computation is the Jacobi method which deals with the standard eigenproblems. In the multivariate Gaussian distribution, the covariance matrix

\sum

can be expressed in terms of eigenvectors

\sum = U Λ U^{- 1} = U Λ U^{T} = [\begin{matrix} u_{1} & u_{2} \end{matrix}] [\begin{matrix} λ_{1} & 0 \\ 0 & λ_{2} \end{matrix}] [\begin{matrix} u_{1}^{T} \\ u_{2}^{T} \end{matrix}]

(11)

where

U = [\begin{matrix} u_{1} & u_{2} \end{matrix}]

are the eigenvectors of

\sum

and

Λ

is the diagonal matrix of the eigenvalues

λ = {[\begin{matrix} λ_{1} & λ_{2} \end{matrix}]}^{T}

Λ = [\begin{matrix} λ_{1} & 0 \\ 0 & λ_{2} \end{matrix}]

(12)

Replacing

\sum

\sum^{- 1} = U Λ^{- 1} U^{- 1}

, the square of the difference can be written as:

[\begin{matrix} x_{1} - μ_{1} & x_{2} - μ_{2} \end{matrix}] U Λ^{- 1} U^{- 1} [\begin{matrix} x_{1} - μ_{1} \\ x_{2} - μ_{2} \end{matrix}] \leq χ_{2}^{2} (0.05)

(13)

U^{T} = U^{- 1}

. Denoting

[\begin{matrix} y_{1} \\ y_{2} \end{matrix}] = U^{- 1} [\begin{matrix} x_{1} - μ_{1} \\ x_{2} - μ_{2} \end{matrix}]

(14)

the square of the difference can then be expressed as:

[\begin{matrix} y_{1} & y_{2} \end{matrix}] [\begin{matrix} λ_{1} & 0 \\ 0 & λ_{2} \end{matrix}] [\begin{matrix} y_{1} \\ y_{2} \end{matrix}] \leq χ_{2}^{2} (0.05)

(15)

If the above equation is further evaluated, the resulting equation is the equation of an ellipse aligned with the axis

y_{1}

and

y_{2}

in the new coordinate system.

\frac{y_{1}^{2}}{χ_{2}^{2} (0 . 05) λ_{1}} + \frac{y_{2}^{2}}{χ_{2}^{2} (0 . 05) λ_{2}} \leq 1

(16)

The axes of the ellipse are defined by

y_{1}

axis with a length

2 \sqrt{λ_{1} χ_{2}^{2} (0 . 05)}

and

y_{2}

axis with a length

2 \sqrt{λ_{2} χ_{2}^{2} (0 . 05)}

When

ρ = 0

, the eigenvectors are equal to

λ_{1} = σ_{1}

and

λ_{2} = σ_{2}

. Also,

U

matrix whose elements are the eigenvectors of

\sum

becomes an identity matrix. The final equation of an ellipse is then defined by

\frac{{(x_{1} - μ_{1})}^{2}}{χ_{2}^{2} (0 . 05) λ_{1}} + \frac{{(x_{2} - μ_{2})}^{2}}{χ_{2}^{2} (0 . 05) λ_{2}} \leq 1

(17)

It is clear from the equation given above that the axes of the ellipse are parallel to the coordinate axes. The lengths of the axes of the ellipse are then defined as

2 \sqrt{σ_{11} χ_{2}^{2} (0 . 05)}

and

2 \sqrt{σ_{22} χ_{2}^{2} (0 . 05)}

The covariance matrix can be presented by its eigenvectors and eigenvalues:

\sum U = U Λ

, where

U

is the matrix whose columns are the eigenvectors of

\sum

and

Λ

is the diagonal matrix with diagonal elements given by the eigenvalues of

\sum

. Transformation is performated based on the three steps involving scaling, rotation, and translation.

Scaling

The coariance matrix can be written as

\sum = U Λ U^{- 1} = U S S U^{- 1}

, where

S

is a diagonal scaling matrix

S = Λ^{1 / 2} = S^{T}

2.: Rotation

U

is generalized from the normalized eigenvectors of the covariance matrix

\sum

U = [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}]

(18)

Note that

U

is an orthogonal matrix

U^{- 1} = U^{T}

, and

| U | = 1

. Define the matrix with rorartion and scaling

T = U S

T^{T} = {(U S)}^{T} = S^{T} U^{T} = S U^{- 1}

. The covariance matrix can thus be written as

\sum = T T^{T}

and

U^{T} \sum U = Λ

being diagonal with eigenvalues

λ_{i}

. Since

T = U S

, we have

Y = T X = U S X = U Λ^{1 / 2} X

[\begin{matrix} x_{1} (t) \\ x_{2} (t) \end{matrix}] = [\begin{matrix} u_{1 x} & u_{2 x} \\ u_{1 y} & u_{2 y} \end{matrix}] [\begin{matrix} \sqrt{λ_{1}} \cos (t) \\ \sqrt{λ_{2}} \sin (t) \end{matrix}] = [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}] [\begin{matrix} \sqrt{λ_{1}} \cos (t) \\ \sqrt{λ_{2}} \sin (t) \end{matrix}]

(19)

The similarity transform is applied to obtain the relation

X^{T} \sum^{- 1} X = Y^{T} U^{T} \sum^{- 1} U Y = Y^{T} Λ^{- 1} Y

, and the pdf of

Y

vector can be found to be

f_{Y} (y) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} \sqrt{λ_{i}}} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}}

(20)

The ellipse in the transformed frame can be represented as

\frac{y_{1}^{2}}{λ_{1}} + \frac{y_{2}^{2}}{λ_{2}} = c

(21)

where the eigenvectors are equal to

λ_{1} = σ_{1}^{2}

and

λ_{2} = σ_{2}^{2}

3.: Translation

x_{1} (t) = \sqrt{λ_{1}} \cos (θ) \cos (t) - \sqrt{λ_{2}} \sin (θ) \sin (t) + μ_{1}

(22)

x_{2} (t) = \sqrt{λ_{1}} \sin (θ) \cos (t) + \sqrt{λ_{2}} \cos (θ) \sin (t) + μ_{2}

(23)

The eigenvalues

λ = {[\begin{matrix} λ_{1} & λ_{2} \end{matrix}]}^{T}

can be calculated through

λ_{1} = \frac{1}{2} [σ_{1}^{2} + σ_{2}^{2} + \sqrt{{(σ_{1}^{2} - σ_{2}^{2})}^{2} + 4 ρ^{2} σ_{1}^{2} σ_{2}^{2}}];

λ_{2} = \frac{1}{2} [σ_{1}^{2} + σ_{2}^{2} - \sqrt{{(σ_{1}^{2} - σ_{2}^{2})}^{2} + 4 ρ^{2} σ_{1}^{2} σ_{2}^{2}}]

and thus

| \sum | = λ_{1} \cdot λ_{2} = σ_{1}^{2} σ_{2}^{2} (1 - ρ^{2})

(24)

From other view point for calculation of covariance matrix

\sum = U Λ U^{T} = [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}] [\begin{matrix} λ_{1} & 0 \\ 0 & λ_{2} \end{matrix}] [\begin{matrix} \cos (θ) & \sin (θ) \\ - \sin (θ) & \cos (θ) \end{matrix}] = [\begin{matrix} λ_{1} \cos (θ) & - λ_{2} \sin (θ) \\ λ_{1} \sin (θ) & λ_{2} \cos (θ) \end{matrix}] [\begin{matrix} \cos (θ) & \sin (θ) \\ - \sin (θ) & \cos (θ) \end{matrix}] = [\begin{matrix} λ_{1} \cos^{2} (θ) + λ_{2} \sin^{2} (θ) & (λ_{1} - λ_{2}) (\sin (θ) - \cos (θ)) \\ s y m s & λ_{1} \sin^{2} (θ) + λ_{2} \cos^{2} (θ) \end{matrix}]

(25)

Claculation for the determinant of covariance matrix above gives the same result and the inverse is

\sum^{- 1} = \frac{1}{λ_{1} \cdot λ_{2}} [\begin{matrix} λ_{1} \sin^{2} (θ) + λ_{2} \cos^{2} (θ) & (λ_{2} - λ_{1}) (\sin (θ) - \cos (θ)) \\ s y m s & λ_{1} \cos^{2} (θ) + λ_{2} \sin^{2} (θ) \end{matrix}] = [\begin{matrix} \frac{\sin^{2} (θ)}{λ_{2}} + \frac{\cos^{2} (θ)}{λ_{1}} & \sin (θ) \cos (θ) (\frac{1}{λ_{1}} - \frac{1}{λ_{2}}) \\ s y m s & \frac{\sin^{2} (θ)}{λ_{1}} + \frac{\sin^{2} (θ)}{λ_{2}} \end{matrix}]

(26)

2.4. Simulation with a Given Variance-covariance Matrix

Given the data

X ~ N (μ, \sum)

, an ellipse representing the confidence p can be plotted by calculating the radii of the ellipse, its center and rotation. Specify

θ

(by which

U

can be obtained) and

S

for generating the covariance matrix

\sum

, thus

ρ

can be derived. The inclination angle is calculated through:

θ = \{\begin{matrix} 0 & i f \begin{matrix}  \end{matrix} σ_{12} = 0 \begin{matrix}  \end{matrix} a n d \begin{matrix}  \end{matrix} σ_{1}^{2} \geq σ_{2}^{2} \\ π / 2 & i f \begin{matrix}  \end{matrix} σ_{12} = 0 \begin{matrix}  \end{matrix} a n d \begin{matrix}  \end{matrix} σ_{1}^{2} < σ_{2}^{2} \\ \tan^{- 1} (λ_{1} - σ_{1}^{2}, σ_{12}) & e l s e \end{matrix}

(27)

which can be used in calculation of

U

U = [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}]

(28)

and the covariance can be evaluated by:

\sum = U Λ U^{T} = U S S U^{T}

S

is specified. On the other way, given the correlation coefficient

ρ

and variances for generating the covariance matrix

\sum

, thus

θ

can be obtained.

To generate the sampling points that meet the specified correlation, the following procedure can be followed. Given two random variables

X_{1}

and

X_{2}

, their linear combination

Y = α X_{1} + β X_{2}

. As for the generation of correlated random variables, if we have two Gaussian, uncorrelated random variables

X_{1}

X_{2}

then we can create 2 correlated random variables using the formula

Y = ρ X_{1} + \sqrt{1 - ρ^{2}} X_{2}

(29)

and then

Y

will have a correlation

ρ

with

X_{1}

ρ = σ_{12} / (σ_{1} σ_{2})

Based on the relation:

X = A Z + μ

Z ~ N (0, 1)

, the following equation can be employed to generate the sampling points for the scatter plots using the Matlab software:

X = A * r a n d n (2, K) + μ * o n e s (1, K)

(30)

where the Cholesky decomposition of

\sum

has a lower triangular matrix for

A

\sum = A A^{T}

and

μ

is the vectors of mean values .

When

ρ = 0

, the axes of the ellipse are parallel to the original coordinate system and when

ρ \neq 0

, axes of the ellipse are aligned with the rotated axes in the transformed coordinate system. Figure 3 and Figure 4 display ellipses drawn for various levels of confidences. The plots provide illustration of confidence (error) ellipses with different confidence levels (i.e., 68%,

s = 2.279

; 90%,

s = 4.605

; 95%,

s = 5.991

; 99%,

s = 9.210

from inner to outer ellipses), considering the cases where the random variables are (1) positively correlated

ρ > 0

, (2) negatively correlated

ρ < 0

, and (3) independent

ρ = 0

. More specifically, in Figure 3, the position of ellipse with various correlation coefficient given by the angel of inclination, specify

θ

to obtain

ρ

ρ = σ_{12} / (σ_{1} σ_{2})

: (a)

θ = 30^{\circ}

ρ \approx 0.55

; (b)

θ = 0^{\circ}

ρ = 0

; (c)

θ = 150^{\circ}

ρ \approx - 0.55

, respectively. On the other hand, in Figure 4, the position of ellipse with various values of correlation constant given the angel of inclination, specify

ρ

to obtain

θ

: (a)

ρ = {0.95}^{\circ}

θ = 45^{\circ}

; (b)

ρ = 0

θ = 0^{\circ}

; (c)

ρ = - {0.95}^{\circ}

θ = 135^{\circ}

, respectively. Rotation angle is measrued

0 \leq θ \leq 180^{\circ}

with respect to the positive axis. When

ρ > 0

, the angle is in the first quadrant and

ρ < 0

, the angle is in the second quadrant.

In the following, two scenarios cases invloving more illustrations will be visited.

(1) Equal variances for two random variables with nonzero

ρ

Case 1: Fixed correlation coefficient. As a example, when

ρ = 0.5

, and the variances

σ_{1} = σ_{2} = σ

are ranging from

2 ~ 5

, as shown in Figure 5. As can be seen, the contours and the scatter plots are ellipses instead of circles.

\sum = [\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{21} & σ_{2}^{2} \end{matrix}] = [\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}] = [\begin{matrix} 4^{2} & 0.5 (4) (2) \\ 0.5 (4) (2) & 2^{2} \end{matrix}] = [\begin{matrix} 4^{2} & 4 \\ 4 & 2^{2} \end{matrix}]

Subplot (a) in Figure 6 shows the ellipses for

ρ = 0.5

with varying variances. In the present and subsequent illustartions, 95% confidence levels are shown.

Case 2: Increasing correlation coefficient

ρ

from zero correlation. With fixed variance

σ_{1} = σ_{2} = σ

, the contour will initially be a circle when

ρ = 0

and then an ellipse as

ρ

increases when

ρ \neq 0

. Sbuplot (b) in Figure 6 provides the contours with scatter plots for

ρ = 0, 0.5, 0.9, 0.99

, respectively when

σ_{1} = σ_{2} = 2

. The eccestricity of the ellipses increases with the increase of

ρ

(2) Unequal variances for two random variables,

σ_{1} \neq σ_{2}

with fixed correlation coefficient.

ρ = 0.5

Case 1:

σ_{1} > σ_{2}

. The variation of three dimensional surfaces and ellipses are presented in Figure 7 and Figure 8a with the increase of

σ_{1} / σ_{2}

, where

σ_{1} = 2 ~ 5

and

σ_{2} = 2

Case 2:

σ_{2} > σ_{1}

. The variation of the ellipses are presented in Figure 8b with the increase of

σ_{2} / σ_{1}

, where

σ_{2} = 2 ~ 5

and

σ_{1} = 2

. Figure 9 shows the variation of inclination angle as a function of

σ_{1}

and

σ_{2}

, for

ρ = 0

and

ρ = 0.5

for providing further insights on the variation of inclination angle

θ

with respect to

σ_{1}

and

σ_{2}

(3) Variation of the ellipses for the various positive and negaive correlation. For a given variance, when

ρ

is specified, thus the eigenvalues and the inclination angle are obtained accordingly. Figure 10 presents results for the cases of

σ_{1} > σ_{2}

(

σ_{1} = 4

σ_{2} = 2

in this example) and

σ_{2} > σ_{1}

(

σ_{1} = 2

σ_{2} = 4

in this example) with various correlation coefficients (namely, positive, zero, and negative) including

ρ = 0, 0.5, 0.9, 0.99

and

ρ = 0, - 0.5, - 0.9, - 0.99

. In the figure,

σ_{1} = 4

σ_{2} = 2

are applied for the top plots; while

σ_{1} = 2

σ_{2} = 4

are applied for the bottom plots. On the other hand,

ρ = 0, 0.5, 0.9, 0.99

are applied for the left plots; while

ρ = 0, - 0.5, - 0.9, - 0.99

are applied for the left plots. Furthermore, Figure 11 provides comparsion of the ellipses for various

σ_{1}

and

σ_{2}

for the following cases: (i)

σ_{1} = 2

σ_{2} = 4

; (ii)

σ_{1} = 4

σ_{2} = 2

; (iii)

σ_{1} = σ_{2} = 2

; (iv)

σ_{1} = σ_{2} = 4

, while fived

ρ = 0.5

3. Continuous Entropy/Differential Entropy

Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula and rather just assumed it was the correct continuous analog of discrete entropy, but it is not.[1]: [181–218]. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy (described here) is commonly encountered in the literature, but it is a limiting case of the LDDP and one that loses its fundamental association with discrete entropy.

In the following discussion, differential entropy, and relative entropy are measured in bits, which is used in the definition. Instead, if ln is used, it is then measured in nats, and the only difference in the expression is the

\log_{2} e

factor.

3.1. Entropy of a Univariate Gaussian Distribution

If we have a continuous random variable

X

with a probability density function (pdf)

f_{X} (x)

, the differential entropy of

X

in bits is expressed as

h (X) = - E [\log_{2} f_{X} (x)] = - \int f_{X} (x) \log_{2} f_{X} (x) d x

(30)

Let

X

be a Gaussian random varialbe

X ~ N (μ, σ^{2})

f_{X} (x) = \frac{1}{\sqrt{2 π} σ} e^{- \frac{1}{2} {(\frac{x - μ}{σ})}^{2}}

The differential entropy for this univariate Guassian distribution can be evaluated

\begin{array}{l} h (X) \\ = - E [\log_{2} f_{X} (x)] \\ = - \int f_{X} (x) \log_{2} f_{X} (x) d x \\ = - \int f_{X} (x) \log_{2} (\frac{1}{\sqrt{2 π} σ} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}) d x \\ = \frac{1}{2} \log_{2} (2 π e σ^{2}) \end{array}

(35)

Figure 12 shows the differential entropy as a function

σ^{2}

for the univariate Gaussian variable, which is concave downward and grows first very fast and then much slower at high values of

σ^{2}

3.2. Entropy of a Multivariate Gaussian Distribution

Let

X

follows a multivariate Gaussian distribution

X ~ N (μ, \sum)

, as given by Equation (2), then the differential entropy of

X

in nats is

h (X) = - E [\log_{2} f_{X} (x)] = - \int f_{X} (x) \log_{2} f_{X} (x) d x

(36)

and the differential entropy is given by (Appendix B)

h (X) = \frac{1}{2} \log_{2} ({(2 π e)}^{n} | \sum |)

(37)

The above calculation involves the evaluation of expectation of the Mahalanobis distance as (Appendix C)

E [{(x - μ)}^{T} \sum^{- 1} (x - μ)] = n

(38)

For a fixed variance, the normal distribution is the pdf that maximizes entropy. Let

X = {[\begin{matrix} X_{1} & X_{2} \end{matrix}]}^{Ｔ}

be a 2D Gaussian vector, the entropy of

X

can be calculated to be

h (X) = h (X_{1}, X_{2}) = \frac{1}{2} \log_{2} ({(2 π e)}^{2} | \sum |) = \log_{2} (2 π e σ_{1} σ_{2} \sqrt{1 - ρ^{2}})

(39)

with covaraince matrix

\sum = [\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{12} & σ_{2}^{2} \end{matrix}] = [\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}]

σ_{1} = σ_{2} = σ

, this becomes

h (X_{1}, X_{2}) = \log_{2} (2 π e σ^{2} \sqrt{1 - ρ^{2}})

(40)

which is a function of

ρ^{2}

concave downward, and grows first very fast and then much slower for high

ρ^{2}

values, shown as in Figure 13.

3.3. The Differential Entropy in the Transformed Frame

The differential entropy is invariant to a translation (change in the mean of the pdf)

h (X + a) = h (X)

and

h (b X) = h (X) + \log_{2} | b |

For a random variable vector, the differential entropy in the transformed frame remains the same as the one in the original frame. It can be shown in general that

h (Y) = h (U X) = h (X) + \log_{2} | U | = h (X)

(41)

For the case of multivariate Gaussian distribution, we have

h (X) = \frac{1}{2} \log_{2} ({(2 π e)}^{n} | \sum |) = \frac{n}{2} \log_{2} (2 π e) + \frac{1}{2} \log_{2} | \sum | = \frac{n}{2} \log_{2} (2 π e) + \sum_{i = 1}^{n} \frac{1}{2} \log_{2} λ_{i}

It is known that the determinant of the covariance matrix is equal to the product of its eigenvalues:

| \sum | = \prod_{i = 1}^{n} λ_{i}

For the case of bivariate Gaussian distribution,

n = 2

, we have

\begin{array}{l} f_{Y} (y) = \prod_{i = 1}^{2} \frac{1}{\sqrt{2 π} \sqrt{λ_{i}}} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}} \\ = \frac{1}{\sqrt{2 π} \sqrt{λ_{1}}} e^{- \frac{1}{2} \frac{y_{1}^{2}}{λ_{1}}} \cdot \frac{1}{\sqrt{2 π} \sqrt{λ_{2}}} e^{- \frac{1}{2} \frac{y_{2}^{2}}{λ_{2}}} \\ = \frac{1}{2 π \sqrt{λ_{1} λ_{2}}} e^{- \frac{1}{2} (\frac{y_{1}^{2}}{λ_{1}} + \frac{y_{2}^{2}}{λ_{2}})} \end{array}

(42)

It can be shown that the entropy in the transformed frame is given by

h (Y) = \frac{2}{2} \log_{2} (2 π e) + \sum_{i = 1}^{2} \log_{2} (λ_{i}) = \log_{2} (2 π e) + \log_{2} (λ_{1} \cdot λ_{2})

Detailed derivation are provided in Appendix D. As discussed, the determinant of the covariance matrix is equal to the product of its eigenvalues

\begin{array}{l} | \sum | = λ_{1} \cdot λ_{2} \\ = (\frac{1}{2} [σ_{1}^{2} + σ_{2}^{2} + \sqrt{{(σ_{1}^{2} - σ_{2}^{2})}^{2} + 4 σ_{1}^{2} σ_{2}^{2} ρ^{2}}]) (\frac{1}{2} [σ_{1}^{2} + σ_{2}^{2} - \sqrt{{(σ_{1}^{2} - σ_{2}^{2})}^{2} + 4 σ_{1}^{2} σ_{2}^{2} ρ^{2}}]) \\ = σ_{1}^{2} σ_{2}^{2} (1 - ρ^{2}) \end{array}

(43)

and thus the entropy can be presented as

h (Y_{1}, Y_{2}) = \frac{1}{2} \log_{2} {(2 π e)}^{2} | \sum | = \frac{1}{2} \log_{2} {(2 π e)}^{2} σ_{1}^{2} σ_{2}^{2} (1 - ρ^{2})) = \log_{2} (2 π e σ_{1} σ_{2} \sqrt{1 - ρ^{2}})

(44)

The result confirms the statement that the differential entropy remains unchanged in the transformed frame.

4. Relative Entropy (Kullback-Leibler Divergence)

In this section, various important issues regarding the relative entropy (Kullback-Leibler divergence will be delivered. Despite the aforementioned flaws, there is a possibility of information theory in the continuous case. A key result is that definitions for relative entropy and mutual information follow naturally from the discrete case and retain their usefulness.

The relative entropy is a type of statistical distance that provides a measure of how one probability distribution

f_{X}

is different from a second, reference probability distribution

g_{X}

, denoted as

D_{K L} (f | | g) = \int f_{X} (x) \log_{2} \frac{f_{X} (x)}{g_{X} (x)} d x

(45)

Detailed derivation is provided in Appendix E. The relative entropy between two Gaussian distributions with different means and variances are given by

D_{K L} (f | | g) = \frac{1}{2} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}} + {(\frac{μ_{1} - μ_{2}}{σ_{2}})}^{2} - 1] \cdot \log_{2} e

(46)

Notice that the relative entropy here is measured in bits where

\log_{2}

is used in the definition. In stead, if ln is used, it would be measured in nats. The only difference in the expression is the

\log_{2} e

factor. Several conditions are discussed.

(1) If

σ_{1} = σ_{2} = σ

D_{K L} (f | | g) = \frac{1}{2} {(\frac{μ_{1} - μ_{2}}{σ})}^{2} \log_{2} e

, which is 0 when

μ_{1} = μ_{2}

Figure 14. Relative entropyas a function of

σ

and

μ_{1} - μ_{2}

when

σ_{1} = σ_{2} = σ

: (a) three dimensional surface; (b) contour with entropy gradient.

Figure 14. Relative entropyas a function of

σ

and

μ_{1} - μ_{2}

when

σ_{1} = σ_{2} = σ

: (a) three dimensional surface; (b) contour with entropy gradient.

(2) If

σ_{1} = σ_{2} = 1

D_{K L} (f | | g) = \frac{1}{2} {(μ_{1} - μ_{2})}^{2} \cdot \log_{2} e

, which is a even function with a minimum value of 0 when

μ_{1} = μ_{2}

Figure 15. Variations of relative entropy when

σ_{1} = σ_{2} = 1

: (a) three dimensional surface as a function of

μ_{1}

and

μ_{2}

(b) as a function of

μ_{1} - μ_{2}

Figure 15. Variations of relative entropy when

σ_{1} = σ_{2} = 1

: (a) three dimensional surface as a function of

μ_{1}

and

μ_{2}

(b) as a function of

μ_{1} - μ_{2}

- If

μ_{2} = 0

D_{K L} (f | | g) = \frac{1}{2} μ_{1}^{2} \log_{2} e

, it is a function of

μ_{1}

concave upward.

- If

μ_{1} = 0

D_{K L} (f | | g) = \frac{1}{2} μ_{2}^{2} \log_{2} e

, it is a function of

μ_{2}

concave upward.

(3) If

μ_{1} = μ_{2}

D_{K L} (f | | g) = \frac{1}{2} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}} - 1] \cdot \log_{2} e

Figure 17. Relative entropy as a function of

σ_{1}

and

σ_{2}

when

μ_{1} = μ_{2}

: (a) the three dimensional surface; (b) contour with entropy gradient.

Figure 17. Relative entropy as a function of

σ_{1}

and

σ_{2}

when

μ_{1} = μ_{2}

: (a) the three dimensional surface; (b) contour with entropy gradient.

- When

σ_{2} = 1

D_{K L} (f | | g) = \frac{1}{2} [\ln (\frac{1}{σ_{1}^{2}}) + σ_{1}^{2} - 1] \cdot \log_{2} e

- When

σ_{1} = 1

D_{K L} (f | | g) = \frac{1}{2} [\ln (σ_{2}^{2}) + \frac{1}{σ_{2}^{2}} - 1] \cdot \log_{2} e

Figure 18. Variations of relative entropy as a function of (a)

σ_{1}

when fixed

σ_{2} = 1

and (b)

σ_{2}

when fixed

σ_{1} = 1

, respectively (

μ_{1} = μ_{2}

Figure 18. Variations of relative entropy as a function of (a)

σ_{1}

when fixed

σ_{2} = 1

and (b)

σ_{2}

when fixed

σ_{1} = 1

, respectively (

μ_{1} = μ_{2}

Sensitivity analysis of the relative entropy due to change of variances and means. The gradient of

D_{K L} (f | | g)

given by

\frac{\partial D_{K L} (σ_{1}, σ_{2}, μ_{1}, μ_{2})}{\partial x} = [\begin{matrix} \frac{\partial D_{K L}}{\partial σ_{1}} & \frac{\partial D_{K L}}{\partial σ_{2}} & \frac{\partial D_{K L}}{\partial μ_{1}} & \frac{\partial D_{K L}}{\partial μ_{2}} \end{matrix}]

can be calculated where the calculation deals with partial derivatives where the cahin rule is involved. Based on the relation

\frac{d}{d x} \ln x = \frac{1}{x}

, we have

\frac{\partial}{\partial σ_{1}} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}})] = \frac{σ_{1}^{2}}{σ_{2}^{2}} \cdot (- 2) σ_{2}^{2} σ_{1}^{- 3} = - \frac{2}{σ_{1}}

and the following derivatives are obtained.

(1)

\frac{\partial D_{K L}}{\partial σ_{1}} = \frac{\partial}{\partial σ_{1}} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}}] \cdot \frac{1}{2} \log_{2} e = (\frac{σ_{1}}{σ_{2}^{2}} - \frac{1}{σ_{1}}) \cdot \log_{2} e

(2)

\frac{\partial D_{K L}}{\partial σ_{2}} = \frac{\partial}{\partial σ_{2}} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}} + \frac{{(μ_{1} - μ_{2})}^{2}}{σ_{2}^{2}}] \cdot \frac{1}{2} \log_{2} e = [\frac{1}{σ_{2}} - \frac{σ_{1}^{2}}{σ_{2}^{3}} - \frac{{(μ_{1} - μ_{2})}^{2}}{σ_{2}^{3}}] \cdot \log_{2} e

(3)

\frac{\partial D_{K L}}{\partial μ_{1}} = \frac{\partial}{\partial μ_{1}} {(\frac{μ_{1} - μ_{2}}{σ_{2}})}^{2} \cdot \frac{1}{2} \log_{2} e = (\frac{μ_{1} - μ_{2}}{σ_{2}^{2}}) \cdot \log_{2} e

(4)

\frac{\partial D_{K L}}{\partial μ_{2}} = \frac{\partial}{\partial μ_{2}} {(\frac{μ_{1} - μ_{2}}{σ_{2}})}^{2} \cdot \frac{1}{2} \log_{2} e = (\frac{μ_{2} - μ_{1}}{σ_{2}^{2}}) \cdot \log_{2} e

For optimality for each of the above cases, we have

\frac{\partial D_{K L}}{\partial σ_{1}} = \frac{σ_{1}}{σ_{2}^{2}} - \frac{1}{σ_{1}} = 0 when σ_{1}^{2} = σ_{2}^{2}

\frac{\partial D_{K L}}{\partial σ_{2}} = \frac{1}{σ_{2}} - \frac{σ_{1}^{2}}{σ_{2}^{3}} - \frac{{(μ_{1} - μ_{2})}^{2}}{σ_{2}^{3}} = 0 when σ_{2}^{2} = σ_{1}^{2} + {(μ_{1} - μ_{2})}^{2}

(\frac{μ_{1} - μ_{2}}{σ_{2}^{2}}) \cdot \log_{2} e = 0 when μ_{1} = μ_{2}

(\frac{μ_{2} - μ_{1}}{σ_{2}^{2}}) \cdot \log_{2} e = 0 when μ_{1} = μ_{2}

5. Mutual Information

Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. The mutual information

I (X; Y)

between two variables with joint pdf

f_{X Y} (x, y)

is given by

I (X; Y) = E [\log \frac{f_{X Y} (x, y)}{f_{X} (x) f_{Y} (y)}] = \int \int f_{X Y} (x, y) \log \frac{f_{X Y} (x, y)}{f_{X} (x) f_{Y} (y)} d x d y

(47)

The mutual information between the random variables X and Y has the following relation

I (X; Y) = I (Y; X)

(48)

where

I (X; Y) = h (X) - h (X | Y) \geq 0

(49)

and

I (Y; X) = h (Y) - h (Y | X) \geq 0

(50)

implying that

h (X) \geq h (X | Y)

and

h (Y) \geq h (Y | X)

. The mutual information of a random variable with itself is the self information, which is the entropy. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables,

I (X; Y) = 0

, meaning that the variables are independent. In such case,

h (X) = h (X | Y)

and

h (Y) = h (Y | X)

Let’s consider the mutual information between the correlated Gaussian variables X and Y given by

\begin{array}{l} I (X; Y) = h (X) + h (Y) - h (X, Y) \\ = \frac{1}{2} \log_{2} (2 π e) σ_{x}^{2} + \frac{1}{2} \log_{2} (2 π e) σ_{y}^{2} - \frac{1}{2} \log_{2} {(2 π e)}^{2} σ_{x}^{2} σ_{y}^{2} (1 - ρ^{2}) \\ = - \frac{1}{2} \log_{2} (1 - ρ^{2}) \end{array}

(51)

Figure 19 presents the mutual information versus

ρ^{2}

, where it grows first much slower and then very fast for high values of

ρ^{2}

. If

ρ = \pm 1

, the random variables X and Y are perfectly correlated, the mutually information is infinite. It can be ceen that

I (X; Y) = 0

for

ρ = 0

and that

I (X; Y) \to \infty

for

ρ \to \pm 1

On the other hand, consider the additive white Gaussian noise (AWGN) channel shown as in Figure 20, the mutual information is given by

I (X; Y) = h (Y) - h (Y | X) = \frac{1}{2} \log_{2} (\frac{2 π e (σ_{x}^{2} + σ_{n}^{2})}{2 π e σ_{n}^{2}}) = \frac{1}{2} \log_{2} (1 + \frac{σ_{x}^{2}}{σ_{n}^{2}})

(52)

where

h (Y | X) = h (N) = h (X, Y) - h (X),

and

h (Y) = \frac{1}{2} \log_{2} (2 π e (σ_{x}^{2} + σ_{n}^{2})); h (Y | X) = h (N) = \frac{1}{2} \log_{2} (2 π e) σ_{n}^{2}

Mutual information for the additive white Gaussian noise (AWGN) channel is shown in Figure 21, including the three-dimensional surface as a function of

σ_{x}^{2}

and

σ_{n}^{2}

;, and also in terms of the the signal-to-noise-ratio

SNR ＝ σ_{x}^{2} / σ_{n}^{2}

. It can be seen that The mutual information grows first very fast and then much slower for high values of the signal-to-noise ratio.

6. Conclusions

This paper intends to serve to the readers as a supplement note on the geometric interpretation of the multivariate Gaussian distribution and its entropy, relative entropy, and mutual information. The illustrative examples are employed to provide further insights into the geometric interpretation of the multivariate Gaussian distribution and its entropy and mutual information, enabling the readers to correctly interpret the theory for future design. The fundamental objective is to study the application of multivariate sets of data in Gaussian distribution. This paper examines broad measurements of structure for Gaussian distributions, which shows that they can be described in terms of the information-theoretic between the given covariance matrix and correlated random variables (in terms of relative entropy). To develop the multivariate Gaussian distribution with the entropy and mutual information, several significant methodologies are presented through the discussion supported by illustrations, both technically and statistically. The content obtained allows readers to better perceive concepts, comprehend techniques, and properly execute software programs for future study on the topic’s science and implementations. It also helps readers grasp the themes’ fundamental concepts. Involving the relative entropy and mutual information as well as the potential correlated covariance analysis based on differential equations, a wide range of information is addressed, including basic to application concerns.

Author Contributions

Conceptualization, D.-J.J.; methodology, D.-J.J.; software, D.-J.J.; validation, D.-J.J. and T.-S.C.; writing—original draft preparation, D.-J.J. and T.-S.C.; writing—review and editing, D.-J.J., T.-S.C. and A. B.; supervision, D.-J.J. All authors have read and agreed to the published version of the manuscript.

Funding

The author gratefully acknowledges the support of the National Science and Technology Council, Taiwan under grant number NSTC 111-2221-E-019-047.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of the differential entropy for the univariate Gaussian distribution

\begin{array}{l} h (X) \\ = - E [\log_{2} f_{X} (x)] \\ = - \int f_{X} (x) \log_{2} f_{X} (x) d x \\ = - \int f_{X} (x) \log_{2} (\frac{1}{\sqrt{2 π} σ} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}) d x \\ = - \int f_{X} (x) (\log_{2} {(2 π σ^{2})}^{- \frac{1}{2}} + \log_{2} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}) d x \\ = - \int f_{X} (x) [(- \frac{1}{2} \log_{2} (2 π σ^{2}) - \frac{{(x - μ)}^{2}}{2 σ^{2}} \log_{2} e)] d x \\ = \frac{1}{2} \log_{2} (2 π σ^{2}) \int_{- \infty}^{\infty} f_{X} (x) d x + \frac{\log_{2} e}{2 σ^{2}} \int_{- \infty}^{\infty} {(x - μ)}^{2} f_{X} (x) d x \\ = \frac{1}{2} \log_{2} (2 π σ^{2}) + \frac{σ^{2}}{2 σ^{2}} \log_{2} e \\ = \frac{1}{2} \log_{2} (2 π e σ^{2}) \end{array}

Appendix B. Derivation of the differential entropy for the multivariate Gaussian distribution

\begin{array}{l} h (X) \\ = - E [\log_{2} f_{X} (x)] \\ = - E [\log_{2} (\frac{1}{\sqrt{{(2 π)}^{n} | \sum |}} e^{- \frac{1}{2} {(x - μ)}^{T} \sum^{- 1} (x - μ)})] \\ = - E [- \frac{n}{2} \log_{2} (2 π) - \frac{1}{2} \log_{2} | \sum | - \log_{2} e^{\frac{1}{2} {(x - μ)}^{T} \sum^{- 1} (x - μ)}] \\ = - \int f_{X} (x) [(- \frac{1}{2} \log_{2} {(2 π)}^{n} | \sum |) - \frac{\log_{2} e}{2} ({(x - μ)}^{T} \sum^{- 1} (x - μ)))] d x \\ = \frac{1}{2} \log_{2} ({(2 π)}^{n} | \sum |) + \frac{n}{2} \log_{2} e \\ = \frac{n}{2} \log_{2} (2 π) + \frac{1}{2} \log_{2} | \sum | + \log_{2} e^{\frac{1}{2} {(x - μ)}^{T} \sum^{- 1} (x - μ)} \\ = \frac{n}{2} \log_{2} (2 π) + \frac{1}{2} \log_{2} | \sum | + \log_{2} e^{\frac{n}{2}} \\ = \frac{n}{2} \log_{2} (2 π) + \frac{1}{2} \log_{2} | \sum | + \frac{n}{2} \log_{2} e \\ = \frac{1}{2} \log_{2} ({(2 π e)}^{n} | \sum |) \end{array}

The calculation involves the evaluation of expectation of the Mahalanobis distance.

Appendix C. Evaluation of expectation of the Mahalanobis distance $E [{(x - μ)}^{T} \sum^{- 1} (x - μ)] = n$

\begin{array}{l} E [{(x - μ)}^{T} \sum^{- 1} (x - μ)] \\ = E [t r ({(x - μ)}^{T} \sum^{- 1} (x - μ))] \\ = E [t r (\sum^{- 1} {(x - μ)}^{T} (x - μ))] \\ = t r (\sum^{- 1} E [{(x - μ)}^{T} (x - μ)]) \\ = t r (\sum^{- 1} \sum) \\ = t r (I_{n}) \\ = n \end{array}

A special case for

n = 1

\begin{array}{l} E [{(x - μ)}^{T} \sum^{- 1} (x - μ)] \\ = E [\frac{{(x - μ)}^{2}}{σ^{2}}] \\ = \int f_{X} (x) (\frac{{(x - μ)}^{2}}{σ^{2}}) d x \\ = \frac{1}{σ^{2}} \int {(x - μ)}^{2} f_{X} (x) d x \\ = 1 \end{array}

Appendix D. Derivation of the differential entropy in the transformed frame

\begin{array}{l} h (Y) \\ = - E [\log_{2} (\prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} \sqrt{λ_{i}}} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}})] \\ = - E [\sum_{i = 1}^{n} \log_{2} (\frac{1}{\sqrt{2 π} \sqrt{λ_{i}}} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}})] \\ = - \sum_{i = 1}^{n} E [\log_{2} (\frac{1}{\sqrt{2 π} \sqrt{λ_{i}}} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}})] \\ = - \sum_{i = 1}^{n} E [\log_{2} (\frac{1}{\sqrt{2 π} \sqrt{λ_{i}}}) + \log_{2} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}}] \\ = - \sum_{i = 1}^{n} [\int f_{Y} (y_{i}) (\log_{2} (\frac{1}{\sqrt{2 π} \sqrt{λ_{i}}}) + \log_{2} e^{- \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}}}) d y_{i}] \\ = - \sum_{i = 1}^{n} [\int f_{Y} (y_{i}) (- \frac{1}{2} \log_{2} (2 π λ_{i}) - \frac{1}{2} \frac{y_{i}^{2}}{λ_{i}} \log_{2} e) d y_{i}] \\ = \sum_{i = 1}^{n} [\frac{1}{2} \log_{2} (2 π λ_{i}) + \frac{1}{2} \log_{2} e] \\ = \sum_{i = 1}^{n} [\frac{1}{2} \log_{2} (2 π) + \frac{1}{2} \log_{2} (λ_{i}) + \frac{1}{2} \log_{2} e] \\ = \frac{n}{2} \log_{2} (2 π e) + \frac{1}{2} \sum_{i = 1}^{n} \log_{2} (λ_{i}) \end{array}

The eigenvalues

λ_{i}

are the diagonal elements of the covariance matrix, namely variances, in the transformed frame. When

ρ = 0

, the eigenvectors are equal to

λ_{i} = σ_{i}^{2}

Appendix E. Derivation of the Kullback–Leibler divergence between two normal distributions

\begin{array}{l} D_{K L} (f | | g) \\ = \int f_{X} (x) \log_{2} \frac{f_{X} (x)}{g_{X} (x)} d x \\ = \int f_{X} (x) \log_{2} \frac{\frac{1}{\sqrt{2 π} σ_{1}} e^{- \frac{1}{2} {(\frac{x - μ_{1}}{σ_{1}})}^{2}}}{\frac{1}{\sqrt{2 π} σ_{2}} e^{- \frac{1}{2} {(\frac{x - μ_{2}}{σ_{2}})}^{2}}} d x \\ = \int f_{X} (x) \log_{2} (\frac{σ_{2}}{σ_{1}}) d x + \int f_{X} (x) \log_{2} [\exp (- \frac{1}{2} {(\frac{x - μ_{1}}{σ_{1}})}^{2} + \frac{1}{2} {(\frac{x - μ_{2}}{σ_{2}})}^{2})] d x \\ = \log_{2} (\frac{σ_{2}}{σ_{1}}) - \frac{\log_{2} e}{2 σ_{1}^{2}} \int f_{X} (x) {(x - μ_{1})}^{2} d x + \frac{\log_{2} e}{2 σ_{2}^{2}} \int f_{X} (x) {(x - μ_{2})}^{2} d x \\ = \log_{2} (\frac{σ_{2}}{σ_{1}}) - \frac{\log_{2} e}{2} + \frac{\log_{2} e}{2 σ_{2}^{2}} \int f_{X} (x) {(x - μ_{1} + μ_{1} - μ_{2})}^{2} d x \\ = \log_{2} (\frac{σ_{2}}{σ_{1}}) - \frac{\log_{2} e}{2} + \frac{\log_{2} e}{2 σ_{2}^{2}} \int f_{X} (x) (x - μ_{1})^{2} + {(μ_{1} - μ_{2})}^{2} + 2 (x - μ_{1}) (μ_{1} - μ_{2})) d x \\ = \frac{1}{2} \log_{2} (\frac{σ_{2}^{2}}{σ_{1}^{2}}) - \frac{\log_{2} e}{2} + \frac{\log_{2} e}{2 σ_{2}^{2}} [σ_{1}^{2} + {(μ_{1} - μ_{2})}^{2}] \\ = \frac{1}{2} [\ln (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}} + {(\frac{μ_{1} - μ_{2}}{σ_{2}})}^{2} - 1] \cdot \log_{2} e \end{array}

where the equality

\log_{2} (\cdot) = \log_{2} e \cdot \ln (\cdot)

was used.

References

Verdú, S. (1990). On channel capacity per unit cost, IEEE Trans. Inf. Theory, vol. 36, no. 5, pp. 1019–1030. [CrossRef]
Lapidoth, A. and Shamai (Shitz), S. (2002). Fading channels: How perfect need perfect side information be? IEEE Trans. Inf. Theory, vol. 48, no. 5, pp. 1118–1134.
Verdú, S. (2002). Spectral efficiency in the wideband regime, IEEE Trans. Inf. Theory, vol. 48, no. 6, pp. 1319–1343. [CrossRef]
Prelov, V. and Verdú, S. (2004). Second-order asymptotics of mutual information, IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1567–1580. [CrossRef]
Kailath, T. (1968). A note on least squares estimates from likelihood ratios, Inf. Contr., vol. 13, pp. 534–540. [CrossRef]
Kailath, T. (1969). A general likelihood-ratio formula for random signals in Gaussian noise, IEEE Trans. Inf. Theory, vol. IT-15, no. 2, pp. 350–361. [CrossRef]
Kailath, T. (1970). A further note on a general likelihood formula for random signals in Gaussian noise, IEEE Trans. Inf. Theory, vol. IT-16, no. 4, pp. 393–396. [CrossRef]
Jaffer A. G. and Gupta S. C. (1972). On relations between detection and estimation of discrete time processes, Inf. Contr., vol. 20, pp. 46–54. [CrossRef]
Jwo, D. J., Biswal, A. (2023). Implementation and Performance Analysis of Kalman Filters with Consistency Validation. Mathematics, 11, 521. [CrossRef]
Duncan, T. E. (1970). On the calculation of mutual information, SIAM J. Applied Mathematics, vol. 19, pp. 215–220. [CrossRef]
Kadota, T. T., Zakai, M. and Ziv, J. (1971). Mutual information of the white Gaussian channel with and without feedback, IEEE Trans. Inf. Theory, vol. IT-17, no. 4, pp. 368–371. [CrossRef]
Amari, S. I. (2016). Information geometry and its applications, Vol. 194. Springer.
Schneidman, E., Still, S., Berry, M. J. and Bialek, W. (2003). Network information and connected correlations. Physical review letters, 91(23), p.238701. [CrossRef]
Timme, N., Alford, W., Flecker, B. and Beggs, J. M. (2014). Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. Journal of computational neuroscience, 36, pp.119-140.
Liang, K. C. and Wang, X. (2008). Gene regulatory network reconstruction using conditional mutual information. EURASIP Journal on Bioinformatics and Systems Biology, 2008, pp.1-14. [CrossRef]
Panzeri, S., Magri, C. and Logothetis, N. K. (2008). On the use of information theory for the analysis of the relationship between neural and imaging signals. Magnetic resonance imaging, 26(7), pp.1015-1025. [CrossRef]
Katz, Y., Tunstrøm, K., Ioannou, C. C., Huepe, C. and Couzin, I. D. (2011). Inferring the structure and dynamics of interactions in schooling fish. Proceedings of the National Academy of Sciences, 108(46), pp.18720-18725. [CrossRef]
Cutsuridis, V., Hussain, A. and Taylor, J. G. eds. (2011). Perception-action cycle: Models, architectures, and hardware. Springer Science & Business Media.
Ay, N., Bernigau, H., Der, R. and Prokopenko, M. (2012). Information-driven self-organization: the dynamical system approach to autonomous robot behavior. Theory in Biosciences, 131, pp.161-179. [CrossRef]
Rosas, F., Ntranos, V., Ellison, C. J., Pollin, S. and Verhelst, M. (2016). Understanding interdependency through complex information sharing. Entropy, 18(2), p.38. [CrossRef]
Ince, R. A. (2017). The Partial Entropy Decomposition: Decomposing multivariate entropy and mutual information via pointwise common surprisal. arXiv preprint arXiv:1702.01591.
Perrone, P. and Ay, N. (2016). Hierarchical quantification of synergy in channels. Frontiers in Robotics and AI, 2, p.35. [CrossRef]
Bertschinger, N., Rauh, J., Olbrich, E., Jost, J. and Ay, N. (2014). Quantifying unique information. Entropy, 16(4), pp.2161-2183. [CrossRef]
Harder, M., Salge, C. and Polani, D. (2013). Bivariate measure of redundant information. Physical Review E, 87(1), p.012130. [CrossRef]
Rauh, J., Banerjee, P. K., Olbrich, E., Jost, J. and Bertschinger, N. (2017). On extractable shared information. Entropy, 19(7), p.328. [CrossRef]
Ince, R. A. (2017). Measuring multivariate redundant information with pointwise common change in surprisal. Entropy, 19(7), p.318. [CrossRef]
Chicharro, D. and Panzeri, S. (2017). Synergy and redundancy in dual decompositions of mutual information gain and information loss. Entropy, 19(2), p.71. [CrossRef]

Figure 1. Standard parametric represenatation of ellipse followed by de La Hire’s point construction.

Figure 3. The position of ellipse with various correlation coefficient given by the angel of inclination, specify

θ

to obtain

ρ

ρ = σ_{12} / (σ_{1} σ_{2})

: (a)

θ = 30^{\circ}

ρ \approx 0.55

; (b)

θ = 0^{\circ}

ρ = 0

; (c)

θ = 150^{\circ}

ρ \approx - 0.55

, respectively.

Figure 3. The position of ellipse with various correlation coefficient given by the angel of inclination, specify

θ

to obtain

ρ

ρ = σ_{12} / (σ_{1} σ_{2})

: (a)

θ = 30^{\circ}

ρ \approx 0.55

; (b)

θ = 0^{\circ}

ρ = 0

; (c)

θ = 150^{\circ}

ρ \approx - 0.55

, respectively.

Figure 4. The position of ellipse with various values of correlation constant given the angel of inclination, specify

ρ

to obtain

θ

: (a)

ρ = 0.95

θ = 45^{\circ}

; (b)

ρ = 0

θ = 0^{\circ}

; (c)

ρ = - 0.95

θ = 135^{\circ}

, respectively.

Figure 4. The position of ellipse with various values of correlation constant given the angel of inclination, specify

ρ

to obtain

θ

: (a)

ρ = 0.95

θ = 45^{\circ}

; (b)

ρ = 0

θ = 0^{\circ}

; (c)

ρ = - 0.95

θ = 135^{\circ}

, respectively.

Figure 5. Equal variances

σ_{1} = σ_{2} = σ

for a fixed

ρ = 0.5

: (a)

σ = 2

(b)

σ = 3

(c)

σ = 4

(d)

σ = 5

Figure 5. Equal variances

σ_{1} = σ_{2} = σ

for a fixed

ρ = 0.5

: (a)

σ = 2

(b)

σ = 3

(c)

σ = 4

(d)

σ = 5

Figure 6. Ellipses for (a)

ρ = 0.5

with varying variances

σ_{1} = σ_{2} = σ = 2 ~ 5

; (b) equal variances

σ_{1} = σ_{2} = 2

with varying

ρ = 0; 0.5; 0.9; 0.99

Figure 6. Ellipses for (a)

ρ = 0.5

with varying variances

σ_{1} = σ_{2} = σ = 2 ~ 5

; (b) equal variances

σ_{1} = σ_{2} = 2

with varying

ρ = 0; 0.5; 0.9; 0.99

Figure 7.

σ_{1} > σ_{2}

σ_{1} / σ_{2}

increases

σ_{1} = 2 ~ 5

σ_{2} = 2

for a fixed

ρ = 0.5

Figure 7.

σ_{1} > σ_{2}

σ_{1} / σ_{2}

increases

σ_{1} = 2 ~ 5

σ_{2} = 2

for a fixed

ρ = 0.5

Figure 8. Ellipses for a fixed correlation coefficient when

σ_{1} \neq σ_{2}

for a fixed

ρ = 0.5

: (a)

σ_{1} > σ_{2}

σ_{1} / σ_{2}

increases where

σ_{1} = 2 ~ 5

and

σ_{2} = 2

; (b)

σ_{2} > σ_{1}

σ_{2} / σ_{1}

increases where

σ_{2} = 2 ~ 5

and

σ_{1} = 2

Figure 8. Ellipses for a fixed correlation coefficient when

σ_{1} \neq σ_{2}

for a fixed

ρ = 0.5

: (a)

σ_{1} > σ_{2}

σ_{1} / σ_{2}

increases where

σ_{1} = 2 ~ 5

and

σ_{2} = 2

; (b)

σ_{2} > σ_{1}

σ_{2} / σ_{1}

increases where

σ_{2} = 2 ~ 5

and

σ_{1} = 2

Figure 9. Variation of inclination angle as a function of

σ_{1}

and

σ_{2}

, for (a)

ρ = 0.5

; (b)

ρ = 0

Figure 9. Variation of inclination angle as a function of

σ_{1}

and

σ_{2}

, for (a)

ρ = 0.5

; (b)

ρ = 0

Figure 10. (

σ_{1} = 4

σ_{2} = 2

) with (a)

ρ = 0, 0.5, 0.9, 0.99

; (b)

ρ = 0, - 0.5, - 0.9, - 0.99

as compared to

σ_{2} > σ_{1}

(

σ_{1} = 2

σ_{2} = 4

) with (c)

ρ = 0, 0.5, 0.9, 0.99

(d)

ρ = 0, - 0.5, - 0.9, - 0.99

Figure 10. (

σ_{1} = 4

σ_{2} = 2

) with (a)

ρ = 0, 0.5, 0.9, 0.99

; (b)

ρ = 0, - 0.5, - 0.9, - 0.99

as compared to

σ_{2} > σ_{1}

(

σ_{1} = 2

σ_{2} = 4

) with (c)

ρ = 0, 0.5, 0.9, 0.99

(d)

ρ = 0, - 0.5, - 0.9, - 0.99

Figure 11. Comparsion of the ellipses for various (i)

σ_{1} = 2

σ_{2} = 4

; (ii)

σ_{1} = 4

σ_{2} = 2

; (iii)

σ_{1} = σ_{2} = 2

; (iv)

σ_{1} = σ_{2} = 4

, while fived

ρ = 0.5

Figure 11. Comparsion of the ellipses for various (i)

σ_{1} = 2

σ_{2} = 4

; (ii)

σ_{1} = 4

σ_{2} = 2

; (iii)

σ_{1} = σ_{2} = 2

; (iv)

σ_{1} = σ_{2} = 4

, while fived

ρ = 0.5

Figure 12. The differential entropy as a function

σ^{2}

for a univariate Gaussian variable.

Figure 12. The differential entropy as a function

σ^{2}

for a univariate Gaussian variable.

Figure 13. Differential entropy for the bivariate Gaussian distribution (a) as function of

ρ^{2}

and

σ^{2}

, (b) as function of

ρ^{2}

when

σ_{1} = σ_{2} = 1

Figure 13. Differential entropy for the bivariate Gaussian distribution (a) as function of

ρ^{2}

and

σ^{2}

, (b) as function of

ρ^{2}

when

σ_{1} = σ_{2} = 1

Figure 19. Mutual information versus

ρ^{2}

between the correlated Gaussian variables.

Figure 19. Mutual information versus

ρ^{2}

between the correlated Gaussian variables.

Figure 20. Schematic illustration of the additive white Gaussian noise (AWGN) channel.

Figure 21. Mutual information for the additive white Gaussian noise (AWGN) channel: (a) the three-dimensional surface as a function of

σ_{x}^{2}

and

σ_{n}^{2}

; (b) in terms of the the signal-to-noise-ratio.

Figure 21. Mutual information for the additive white Gaussian noise (AWGN) channel: (a) the three-dimensional surface as a function of

σ_{x}^{2}

and

σ_{n}^{2}

; (b) in terms of the the signal-to-noise-ratio.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

A Geometric Interpretation of the Multivariate Gaussian Distribution and its Entropy and Mutual Information

Abstract

1. Introduction

2. Geometry of the Gaussian Distribution

2.1. Standard Parametric Represenatation of an Ellipse

2.2. The Confidence Ellipse

2.3. Similarity Transform

2.4. Simulation with a Given Variance-covariance Matrix

3. Continuous Entropy/Differential Entropy

3.1. Entropy of a Univariate Gaussian Distribution

3.2. Entropy of a Multivariate Gaussian Distribution

3.3. The Differential Entropy in the Transformed Frame

4. Relative Entropy (Kullback-Leibler Divergence)

5. Mutual Information

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation of the differential entropy for the univariate Gaussian distribution

Appendix B. Derivation of the differential entropy for the multivariate Gaussian distribution

Appendix C. Evaluation of expectation of the Mahalanobis distance E [ ( x − μ ) T ∑ − 1 ( x − μ ) ] = n

Appendix D. Derivation of the differential entropy in the transformed frame

Appendix E. Derivation of the Kullback–Leibler divergence between two normal distributions

References

MDPI Initiatives

Important Links

Subscribe

Appendix C. Evaluation of expectation of the Mahalanobis distance $E [{(x - μ)}^{T} \sum^{- 1} (x - μ)] = n$