1.1. Aims and Scope
Entropy is a concept that appears in different areas of physics and mathematics with different meanings. Thus, entropy is a measure of: (i) disorder in Statistical Mechanics, (ii) uncertainty in Information and Probability Theories, (iii) (pseudo-)randomness in the Theory of Measure-preserving Dynamical Systems, and (iv) complexity in Topological Dynamics. This versatility explains why entropy has found extensive applications across various scientific disciplines since its inception in the 19th century.
Precisely, this paper aims to provide an up-to-date overview of the applications of entropy in data analysis and machine learning, where entropy stands here not only for the traditional instances but also for more recent proposals inspired by them. In data analysis, entropy is a powerful tool for detection of dynamical changes, segmentation, clustering, discrimination, etc. In machine learning, it is used for classification, feature extraction, optimization of algorithms, anomaly detection, and more. The ability of entropy to provide insights into data structure and algorithm performance has led to a widespread search for further applications and new proposals tailored to specific needs, both in data analysis and machine learning.
This being the case, the present review will be useful for researchers in the above two fields, interested in the theoretical basics and/or the current applications of entropy. Along with established applications, the authors also have taken into account innovative proposals to reflect the intense research activity on entropy that is currently underway.
At this point, the reader may be wondering what entropy is. A search for the word “entropy” on the Internet returns a large number of results, some of them also called entropy metrics, entropy-like measures or entropy-based indices in the literature. So, what is entropy actually?
1.2. Classical Entropies and Generalized Entropies
Historically, the word “entropy” was introduced by the German physicist Clausius in Thermodynamics in 1865 to designate the amount of internal energy in a system that cannot be transformed into work. In particular, entropy determines the equilibrium of a thermodynamical system, namely, the state of maximum entropy consistent with the macroscopic constraints. In the second half of the 19th century, entropy was given a microscopic interpretation in the foundational works of Boltzmann and Gibbs on Statistical Mechanics. In 1927, von Neumann generalized the Boltzmann-Gibbs entropy to the then-emerging theory of Quantum Mechanics [
1]. In 1948, the word entropy appeared in a completely different context: Information Theory. If entropy is a measure of disorder in Statistical Mechanics; in the seminal paper of Shannon [
2], the creator of Information Theory, entropy stands for the average uncertainty about the outcome of a random variable (or the information conveyed by knowing it). Albeit in different realms, the coincidence in names is explained because Shannon’s formula (see equation (
1) below) is formally the same as Gibbs’ for the entropy of a system in thermal equilibrium with a heat bath at constant temperature [
3].
This abridged history of entropy continues with Kolmogorov, who crafted Shannon’s entropy into a useful invariant in Ergodic Theory [
4], and Sinai, who adapted Kolmogorov’s proposal to the theory of measure-preserving dynamical systems [
5]. In turn, Adler, Konheim and McAndrew [
6] generalized the Kolmogorov-Sinai (KS) entropy from measure-preserving dynamics to topological dynamics under the name of topological entropy. According to the Variational Principle, topological entropy is a tight upper bound of the KS entropies defined over certain probability measures [
7].
To get down to the mathematical formulas, let
be the set of probability mass distributions
for all
. Then, the Shannon entropy of the probability distribution
is defined as
where the choice of the logarithm base fixes the unit of the entropy. The usual choices being 2 (bit),
e (nat) or 10 (dit). If
, then
. Mathematically, equation (
1) is the expected value of the
information function, where
X is a random variable with probability distribution
. Since entropy is the cornerstone of Information Theory, Shannon also justified definition (
1) by proving in his seminal paper [
2] that it is unique (except for a positive factor) under a few, general assumptions. In their modern (equivalent) formulation, these assumptions are called the
Shannon-Khinchin axioms [
8], that we state below.
A positive functional H on , i.e., a map ( being the non-negative real numbers), is an entropy if it satisfies the following properties:
-
SK1
Continuity. depends continuously on all variables for each W.
-
SK2
-
SK3
Expansibility. For all
W and
,
-
SK4
Strong additivity (or
separability). For all
,
where
.
Axiom SK4 can be formulated in a more compact way as
where
X and
Y are random variables with probability distributions
and
respectively,
and
is the entropy of
Y conditional on
X, i.e., the expected value of the conditional distributions
, averaged over the conditioning variable
X[
9]. In particular, if
X and
Y are independent (i.e.,
), then
and
If
H satisfies equation (
3) for independent random variables
X and
Y, then it is called
additive.
It was proved in [
2] and [
8] that a positive functional
H on
that fulfills Axioms SK1-SK4 is necessarily of the form
for every
, where
k is a positive constant. For historical reasons,
is usually called the
Boltzmann-Gibbs-Shannon entropy. In Physics,
k is the Boltzmann constant
J/K and log is the natural logarithm. In Information Theory,
and log is the base 2 logarithm when dealing with digital communications. The particular case
obtained for uniform distributions is sometimes referred to as the
Boltzmann entropy, although the expression (
5) is actually due to Planck [
10]. According to Axiom SK2, the Boltzmann entropy is the maximum of
.
The same conclusion about the uniqueness of
can be derived using other equivalent properties [
11]. Since we are not interested in physical applications here, we set
and generally refer to
as Shannon’s entropy.
In 1961 Rényi proposed a generalization of Shannon’s entropy by using a different, more general definition of expectation value [
12,
13]: For any real
,
,
Rényi entropy is defined as
So, Rényi entropy is actually a family of entropies; in particular,
. Other limiting cases are
, called
Hartley or
max-entropy, which coincides with the Boltzmann entropy (
5) except for the value of the constant
k, and
, called the
min-entropy. These names are due to the decreasing monotonicity of Rényi’s entropy with respect to the parameter:
for
.
It is easy to show that Rényi’s entropy satisfies Axioms SK1-SK3 but not SK4. Instead of strong additivity,
satisfies
additivity:
see equation (
3).
A final milestone in this short history of entropy is the introduction of
non-additive entropies by Havrda and Charvát in Information Theory [
14] and Tsallis in Statistical Mechanics [
15], which are equivalent and usually called the Tsallis entropy:
for any real
,
. Again, Tsallis entropy is a family of entropies that satisfies Axioms SK1-SK3 but not SK4. Instead,
is “
q-additive”, meaning that
As Rényi’s entropy, the Tsallis entropy is a generalization of Shannon’s entropy in the sense that
. Formally,
can be obtained from
by replacing the logarithm in equation (
4) by the “
q-logarithm” [
13].
The appearance of generalizations of the Shannon entropy prompted the weaker concept of generalized entropy: a positive functional on probability distributions that satisfies Axioms SK1-SK3. Therefore, the BGS entropy, together with the Rényi and Tsallis entropies are examples of generalized entropies. Shannon’s uniqueness theorem can then be rephrased by saying that, the only generalized entropy that is strongly additive is the BGS entropy. Axioms SK1-SK3 are arguably the minimal requirements for a positive functional on probability mass distributions to be called an entropy. Most “entropies” proposed since the formulation of Rényi and Tsallis’ entropies are precisely generalized entropies in the axiomatic sense.
To wrap up this short account of the classical and generalized entropies, let us mention that Shannon’s, Rényi’s and Tsallis’ entropies (and other entropies for that matter) have counterparts for continuous-valued random variables and processes (i.e., defined on probability densities). These “differential” versions are formally obtained by replacing probability mass functions by probability densities and summations by integrations in equations (
1), (
6) and (
7), respectively. Although also useful in applications, differential entropies may lack important properties of their discrete counterparts. For example,
differential (Shannon’s)
entropy lacks positivity [
9].