1. Introduction
The Simpson index concerning a population distributed among
k categories or classes is defined as
where
denotes the probability (or proportion of occurrences) of class
i. So, one has
and
, and therefore
S is defined on a
simplex. This index equals the probability that two elements taken at random from the population of interest belong to the same category or class. The value of Simpson’s index ranges from
to 1, with 1 representing no diversity, so, the larger the value of
the lower the diversity. The name “Simpson index” roots from the influential 1949 paper by Edward Hugh Simpson entitled “Measurement of Diversity” [
1], wherein he introduced what he called a measure of concentration defined in terms of population constants, with the minimum concentration equaling the maximum diversity. The Simpson index became a widely used quantitative metric in ecological and biodiversity studies as a tool for assessing and quantifying the diversity and evenness of species within ecological communities
. It also applies to other biological problems, including biomedical sciences, such as measuring diversity concerning immunity in response to viral infections (e.g., [
2]).
However, it is also acknowledged that the original mathematical concept formulation was used in cryptanalysis as far back as the 1920s and 30s — therein named
probability of monographic coincidence — by the American cryptanalysts William Friedman and Solomon Kullback (e.g., [
3])
. It is relevant to note that the Italian statistician Corrado Gini had already applied the quantity
as early as 1912. He defined the index with relative frequencies
computed from large samples, referring to it as an
index of mutability for disconnected (qualitative) variables [
4]
. This quantity became later known as the “Gini-Simpson index”, a name adopted in the 1980s by the eminent statistician C. R. Rao (e.g., [
5,
6]), who restated it with probabilities as
. For instance, Jian et al. [
7] consider a “Livelihood Simpson Index” which in fact is a Gini-Simpson index. Obviously, given a probability distribution
, the Simpson and Gini-Simpson indices correspond to complementary events, verifying
.
The use of a weighted version of the Simpson index appears to have been first reported in 1992, when Nowak and May [
8] conceived the effective immune response against the virus population composed of different strains in the context of HIV infections, then revisited two years later [
9].
Some refer to the weighted Simpson index when they are actually dealing with the weighted Gini-Simpson index (e.g., [
10,
11]). The weighted Gini-Simpson index is defined as
, a concave function, differentiable in the interior of a simplex, with an identifiable maximum value for which a method to determine the optimal point (maximizer) was framed based on the fact that one is dealing with feasibility values associated with the constraints of the simplex [
12] – namely, that the optimal coordinates must verify
– what was not taken into account in [
11] and may lead to miscalculations.
Yet, Kasulo and Perrings used a price-weighted Simpson index [
13] to assess scenarios relative to the connection between the diversity of catch in a multi-species fishery and profit maximizing regimes. And one can still find an inverse form of the weighted Simpson index in [
14], which the authors clarify could be interpreted as a weighted version of Hill’s number
, concerning the unifying notation that was proposed therein [
15]. Also, one can see that J. Ma used a symmetric form of the weighted Simpson index, building what he named a “comprehensive weighted Gini-Simpson index” [
16].
In recent years, there has been a panoply of diversity, phylogenetic, and dissimilarity indices with a focus on biology (e.g., [
17,
18]), most of them associated with developments related to Rényi measures of entropy [
19], including general reviews on the subject (e.g.,[
20]), and a plethora of recent applications (e.g., [
21]). There are also many publications within the scope of diversity in the social sciences – a process that had a cornerstone with the reference work of Patil and Taillie [
22] addressing linguistic diversity, industrial concentration and income inequality – with applications either in economics (e.g., [
23]) or demographics, such as the weighted Rényi’s entropy for lifetime distributions [
24].
However, with regard to the weighted Simpson index, it seems that an analytical study addressing the optimal point (minimizer) and the optimal value (minimum), has not yet been published.
2. The weighted Simpson index
Herein, we define the weighted Simpson index, concerning a population distributed among
k categories or classes, as:
where
denotes the probability (or proportion of occurrences) of class
i and
is a weight assigned to that class, altogether defining a vector of positive real values
. For now, we have decided not to impose any extra conditions on the weights, leaving this matter to be discussed later. Our current focus is on understanding the broader context.
Weights allow to consider various features for the classes. In the context of biodiversity, these features may be related for example to environmental benefits, conservation importance, vulnerability or economic value of species – a subject that was already emphasized at least since the beginning of the 1980s, exemplified with biomass and other importance values [
25]. The environmental benefits may include the ecological roles and contributions of a species to its ecosystem. Conservation importance further adds depth to the evaluation, addressing the urgency and priority assigned to preserving certain species to maintain ecosystem stability. Vulnerability is another important factor that weights help account for. Vulnerable species—those in danger of going extinct—need to be given particular attention during the assessment process. Economic value represents yet another dimension where weights come into play. The assessment of species in terms of their economic contributions, whether through ecosystem services, medicinal properties, or commercial value, demands a careful weighting to reflect their overall importance. Yet, one should be aware of the complexities and entanglements associated with a community structure when dealing with several trophic levels, where, in addition to competition, there may occur symbiosis and predator-prey interactions conveying a hierarchical structure in a networked ecological community (e.g., [
26]).
2.1. Optimizing the weighted Simpson index
The weighted Simpson index (1) is defined on the simplex and is a convex function. So, attains its (global) maximum at the boundary of and its (global) minimum at the interior of . Clearly, the maximum is attained when all except one of the are zero, and therefore one has .
The minimum can be assessed with the method of Lagrange multipliers for finding the minima of
subject to the equality constraint
. As
is differentiable in the interior of
, one can build the Lagrangian function
and find its extrema. Equating partial derivatives to zero, we get:
(2)
From (2) we conclude that for any specific j the relationship
holds. Now, using the constraint
one has the following equivalence
and we get the optimal coordinates of the minimum point given by:
Also, using (3) one can evaluate the minimum value of the weighted Simpson index (1) as follows:
(4)
2.2. Some further comments on the minimizer
Note that the minimum value of the weighted Simpson index (4) is related to the harmonic mean of the weights
by
. The name “harmonic mean” is said to have been proposed by Archytas and Hippasus and adopted by Nicomachus in accordance with the view of the geometrical harmony of the cube, because it has 12 edges, 8 vertices, and 6 faces, and 8 is the mean of 12 and 6 according to the theory of harmonics [
27] (pp. 85-86), meaning the harmonic mean of those quantities. In general, the harmonic mean of
nonzero numbers is the reciprocal of the arithmetic mean of the reciprocals of those numbers, and is appropriate for averaging rates over constant numerator units [
28]. As a typical example, if a set of investments are invested at different interest rates, and they all give the same income, the unique rate at which all of the capital tied up in those investments must be invested to produce the same revenue as given by the set of investments, is equal to the harmonic mean of the individual rates [
29] (p. 240).
The special case of all weights being equal to 1 leads to and to the minimum value as can be seen from expressions (3) and (4), and also because in this case with being the (unweighted) Simpson index whose minimum is .
Rewriting the optimal coordinates in (3) as
it is straightforward to conclude that the weights are driving forces of the values of the coordinates of the minimizer, operating reciprocally: when the weight attached to a specific class increases and all the others keep invariant, the corresponding optimal coordinate decreases; and when a weight attached to another class increases, all the others being invariant, the original class increases its optimal coordinate.
If one considers the valuation of the classes of a distribution in the usual sense of importance assessment with an ordering of positive real numbers, expecting that would promote the result , then one should be aware that the weights associated with the optimal point (3) would not be the values but could possibly be conceived like and instead.
2.3. Normalization
The use of normalized indices in applications is important for several reasons. Normalized indices provide a standardized scale, usually ranging from 0 to 1, or 0% to 100%, irrespective of the specific scale or magnitude of the data. This standardization allows for direct comparisons between different datasets or populations and remains consistent across different contexts and scales.
In the case of the weighted Simpson index this can be done in a classic way defining the normalized weighted Simpson index as
However, this normalization eliminates the effect of the number of classes in the distribution (e.g., species in a community). For example, in the case of all weights equal to 1, the normalized weighted Simpson index of a population with
species uniformly distributed is always 0 and thus independent of the number of species. The fact that normalized indices of diversity can be misleading has already been mentioned by several authors (e.g., [
11]).
In the case of a weighted index, it may be relevant to normalize the weights so that the index becomes dimensionless and independent of the order of magnitude of the weights. For the weighted Simpson index, this normalization can be done, for example, by imposing the condition . This corresponds to dividing each non-normalized weight by the sum of all the weights. As for , this normalization procedure does not affect the results of the previous sections, namely, for the weighted Simpson index with normalized weights , the optimal coordinates of the minimum point are as in (3) and the value of the minimum of is as in (4).
2.4. The inverse problem
The inverse procedure relative to the weighted Gini-Simpson was formulated recently [
30]. Now, we consider the analogous problem concerning the weighted Simpson index, stated as:
given a minimum point of
denoted by
, verifying both
and
, what would be a set of weights able to generate that solution? The answer to this question follows straightforward: recalling (3), the weights must be chosen to be inversely proportional to the optimal coordinates
with the proportionality constant equal to
(4), as we can see rewriting:
For non-normalized weights, there are infinitely many solutions to the inverse problem, parameterized by the minimum
. For normalized weights, using the condition
, one gets
and so the weights must be chosen as:
(5)
It should be noted that from the condition that the sum of the weights equals 1 it follows that .
3. Final remarks
We have presented a detailed analytical study of the optimization problem associated with the weighted Simpson index. The core result is that at the optimal point one has
for
, and also, that for all
i, one gets
, the minimum value of the index. So, there is a trade-off between the weights and the optimal probabilities (or proportions of occurrences) in what could be seen as an equilibrium condition. The fact that Nowak [
8,
9] has used the weighted Simpson index as a Lyapunov function to assess an antigenic diversity threshold, seems compatible with an equilibrium point perspective, which could also be used in a broad sense concerning different problems within several scientific fields.
Furthermore, for a random variable with values corresponding to the previous weights , and probability function given by , with defined as in (3), computing the mean value of entails , which equals the harmonic mean of the weights, meaning .