This section aims to provide the required background information and introduce the used notation.
Section 2.1 discusses the Blackwell order and its special case at binary targets, the zonogon order, which will be used for operational interpretations and the representation of
f-information for its decomposition.
Section 2.2 discusses the PID framework of Williams and Beer [
1] and the relation between a decomposition based on the redundancy lattice and one based on the synergy lattice. We also demonstrate the unintuitive behavior of the original decomposition measure which will be resolved by our proposal in
Section 3.
Section 2.3 provides the considered definitions of
f-information, Rényi-information, and Bhattacharyya information for the later demonstration of transforming decomposition results between measures.
2.1. Blackwell and Zonogon Order
Definition 1 (Channel). A channel from to represents a garbling of the input variable T that results in variable S. Within this work, we represent an information channel μ as (row) stochastic matrix, where each element is non-negative, and all rows sum to one.
For the context of this work, we consider a variable
S to be the observation of the output from an information channel
from the target variable
T, such that the corresponding channel can be obtained from their conditional probability distribution, as shown in Equation
1 where
and
.
Notation 2 (Binary input channels). Throughout this work, we reserve the symbol κ for binary input channels, meaning κ signals a stochastic matrix of dimension . We use the notation to indicate a column of this matrix.
Definition 2 (More informative [
17,
21]).
An information channel is more informative than another channel if - for any decision problem involving a set of actions and a reward function that depends on the chosen action and state of the variable T - an agent with access to can always achieve an expected reward at least as high as another agent with access to .
Definition 3 (Blackwell order [
17,
21]).
The Blackwell order is a preorder of channels. A channel is Blackwell superior to channel , if we can pass its output through a second channel λ to obtain an equivalent channel to , as shown in Equation 2.
Blackwell [
21] showed that a channel is more informative if and only if it is Blackwell superior. Bertschinger and Rauh [
17] showed that the Blackwell order does not form a lattice for channels
if
since the ordering does not provide unique meet and join elements. However, binary target variables
are a special case where the Blackwell order is equivalent to the zonogon order (discussed next) and does form a lattice [
17].
Definition 4 (Zonogon [
17]).
The zonogon of a binary input channel is defined using the Minkowski sum from the collection of vector segments as shown in Equation 3. The zonogon can similarly be defined as image of the unit cube under the linear map of κ.
The zonogon
is a centrally symmetric convex polygon, and the set of vectors
span its perimeter.
Figure 2 shows the example of a binary input channel and its corresponding zonogon.
Definition 5 (Zonogon sum).
The addition of two zonogons corresponds to their Minkowski sum as shown in Equation 4.
Definition 6 (Zonogon order [
17]).
A zonogon is zonogon superior to another if and only if .
Bertschinger and Rauh [
17] showed that for binary input channels, the zonogon order is equivalent to the Blackwell order and forms a lattice (Equation
5). In the remaining work, we will only discuss binary input channels, such that the orderings of Definition 2, 3 and 6 are equivalent and can be thought of as zonogons with subset relation.
For obtaining an interpretation of what a channel zonogon
represents, we can consider a binary decision problem by aiming to predict the state
of a
binary target variable
T using the output of channel
. Any decision strategy
for obtaining a binary prediction
can be fully characterized by its resulting pair of True-Positive Rate (TPR) and False-Positive Rate (FPR), as shown in Equation
6
Therefore, a channel zonogon
provides the set of all achievable (TPR,FPR)-pairs for a given channel
[
20,
22]. This can also be seen from Equation
3, where the unit cube
represents all possible first columns of the decision strategy
. The first column of
fully determines the second since each row has to sum to one. As a result,
provides the (TPR,FPR)-pair for the decision strategy
and the definition of Equation
3 all achievable (TPR,FPR)-pairs for predicting the state of a binary target variable. Since this will be helpful for operational interpretations, we label the axis of zonogon plots accordingly, as shown in
Figure 2. The zonogon ([
17] p. 2480) is the Neyman-Pearson region ([
7] p. 231).
Definition 7 (Neyman-Pearson region [
7] & decision regions).
The Neyman-Pearson region for a binary decision problem is the set of achievable (TPR,FPR)-pairs and can be visualized as shown in Figure 2. The Neyman-Pearson regions underlie the zonogon order and their boundary can be obtained from the likelihood-ratio test. We refer to subsets of the Neyman-Pearson region as reachable decision regions, or simply decision regions, and its boundary as zonogon perimeter.
Remark 2. Due to the zonogon symmetry, the diagram labels can be swapped (FPR x-axis/TPR y-axis) which changes the interpretation to aiming a prediction for .
Notation 3 (Channel lattice). We use the notation for the meet element of binary input channels under the Blackwell order and for their join element. We use the notation for the top element of binary input channels under the Blackwell order and for the bottom element.
For binary input channels, the meet element of the Blackwell order corresponds to the zonogon intersection
and the join element of the Blackwell order corresponds to the convex hull of their union
. Equation
7 describes this for an arbitrary number of channels.
Example 1. The remaining work only analyzes indicator variables, so we only need to consider the case where all presented ordering relations of this section are equivalent and form a lattice.
Figure 3a visualizes a channel with . We can use the observations of S for making a prediction about T. For example, we predict that T is in its first state with probability if S is in its first state, with probability if S is in its second state and with probability if S is in its third state. This randomized decision strategies can be noted as stochastic matrix λ shown in Figure 3a. The resulting TPR and FPR of this decision strategy is obtained from the weighted sum of these parameters (, and ) with the vectors in κ. Each decision strategy corresponds to a point within the zonogon, since the probabilities are constrained by and the resulting zonogon is the Neyman-Pearson region.
Figure 3b visualizes an example for the discussed ordering relations, where all observable variables have two states: where . The zonogon/Neyman-Pearson region corresponding to variable is fully contained within the others ( and ). Therefore, we can say that is Blackwell inferior (Definition 3) and less informative (Definition 2) than and about T. Practically, this means that we can construct an equivalent variable to by garbling or and that for any sequence of actions based on and any reward function with dependence on T, we can achieve an expected reward at least a high by acting based on or instead. The variables and are incomparable from the zonogon order, Blackwell order, and informativity order, since the Neyman-Pearson region of one is not fully contained in the other.
The zonogon shown in Figure 3a corresponds to the join under the zonogon order, Blackwell order and informativity order of and in Figure 3b about T. For binary targets, this distribution can directly be obtained from the convex hull of their Neyman-Pearson regions and corresponds to a valid joint distribution for . All other joint distributions are either equivalent or superior to it. When doing this on indicator variables for , then the obtained joint distributions for each may not combine into a specific valid overall joint distribution.
2.2. Partial Information Decomposition
The commonly used framework for PIDs was introduced by Williams and Beer [
1]. A PID is computed with respect to a particular random variable that we would like to know information
about, called the target, and tries to identify
from which variables that we have access to, called visible variables, we obtain this information. Therefore, this section considers sets of variables that represent their joint distribution.
Notation 4. Throughout this work, we use the notation T for the target variable and for the set of visible variables. We use the notation for the power set of , and for its power set without the empty set.
The used filter for obtaining the set of atoms (Equation
8) removes sets that would be equivalent to other elements. This is required for obtaining a lattice from the following two ordering relations:
Definition 9 (Redundancy-/Gain-lattice [
1]).
The redundancy lattice is obtained by applying the ordering relation of Equation 9 to all atoms .
The redundancy lattice for three visible variables is visualized in
Figure 4a. On this lattice, we can think of an atom as representing the information that can be obtained from all of its sources about the target
T (their redundancy or informational intersection). For example, the atom
represents on the redundancy lattice the information that is contained in both
and
about
T. Since both sources in
provide the information of
, their redundancy contains at least this information, and the atom
is considered its predecessor. Therefore, the ordering indicates an informational subset relation for the redundancy of atoms, and the information that is represented by an atom increases as we move up. The up-set of an atom
on the redundancy lattice indicates the information that is lost when losing all of its sources. Considering the example from above, if we lose access to
and
, then we lose access to all atoms in the up-set of
.
Definition 10 (Synergy-/Loss-lattice [
23]).
The synergy lattice is obtained by applying the ordering relation of Equation 10 to all atoms .
The synergy lattice for three visible variables is visualized in
Figure 4b. On this lattice, we can think of an atom as representing the information that is contained in neither of its sources (information outside their union). For example, the atom
represents on the synergy lattice the information that is obtained from neither
nor
about
T. The ordering again indicates their expected subset relation: the information that is obtained from neither
nor
is fully contained in the information that cannot be obtained from
and thus
is a predecessor of
.
With an intuition for both ordering relations in mind, we can see how the filter in the construction of atoms (Equation
8) removes sets that would be equivalent to another atom: the set
is removed from the power set of sources since it would be equivalent to the atom
under the ordering of the redundancy lattice and equivalent to the atom
under the ordering of the synergy lattice.
Notation 5 (Redundancy/Synergy lattices). We use the notation for the join and meet operators on the redundancy lattice, and for the join and meet operators on the synergy lattice. We use the notation for the top and for the bottom atom on the redundancy lattice, and and for the top and bottom atom on the synergy lattice. For an atom α, we use the notation for its down-set, for its strict down-set, and for its cover set. These definitions will only appear in the Möbius inverse of a function that is directly associated with either the synergy or redundancy lattice such that there is no ambiguity about which ordering relation has to be considered.
The redundant, unique, or synergetic information (partial contributions) can be calculated based on either lattice. They are obtained by quantifying each atom of the redundancy or synergy lattice with a cumulative measure that increases as we move up in the lattice. The partial contributions are then obtained in a second step from a Möbius inverse.
Definition 11 ([Cumulative] redundancy measure [
1]).
A redundancy measure is a function that assigns a real value to each atom of the redundancy lattice. It is interpreted as a cumulative information measure that quantifies the redundancy between all sources of an atom about the target T.
Definition 12 ([Cumulative] loss measure [
23]).
A loss measure is a function that assigns a real value to each atom of the synergy lattice. It is interpreted as a cumulative measure that quantifies the information about T that is provided by neither of the sources of an atom .
To ensure that a redundancy measure actually captures the desired concept of redundancy, Williams and Beer [
1] defined three axioms that a measure
should satisfy. For the synergy lattice, we consider the equivalent axioms discussed by Chicharro and Panzeri [
23]:
Axiom 1 (Commutativity [1,23])
.
Invariance in the order of sources (permuting the order of indices):
Axiom 2 (Monotonicity [1,23])
.
Additional sources can only decrease redundant information. Additional sources can only decrease the information that is in neither source.
Axiom 3 (Self-redundancy [1,23])
.
For a single source, redundancy equals mutual information. For a single source, the information loss equals the difference between the total available mutual information and the mutual information of the considered source with the target.
The first axiom states that an atom’s redundancy and information loss should not depend on the order of its sources. The second axiom states that adding sources to an atom can only decrease the redundancy of all sources (redundancy lattice) and decrease the information from neither source (synergy lattice). The third axiom binds the measures to be consistent with mutual information and ensures that the bottom element of both lattices is quantified to zero.
Once a lattice with corresponding cumulative measure (
/
) is defined, we can use the Möbius inverse to compute the partial contribution of each atom. This partial information can be visualized as partial area in a Venn diagram (see
Figure 1a) and corresponds to the desired redundant, unique, and synergetic contributions. However, the same atom represents different partial contributions on each lattice: As visualized for the case of two visible variables in
Figure 1, the unique information of variable
is represented by
on the redundancy lattice and by
on the synergy lattice.
Definition 13 (Partial information [
1,
23]).
Partial information and corresponds to the Möbius inverse of its corresponding cumulative measure on the respective lattice.
Remark 3. Using the Möbius inverse for defining partial information enforces an inclusion-exclusion relation in that all partial information contributions have to sum to the corresponding cumulative measure. Kolchinsky [16] argues that an inclusion-exclusion relation should not be expected to hold for PIDs and proposes an alternative decomposition framework. In this case, the sum of partial contributions (unique/redundant/synergetic information) is no longer expected to sum to the total amount .
Property 1 (Local positivity, non-negativity [
1]).
A partial information decomposition satisfies non-negativity or local positivity if its partial information contributions are always non-negative, as shown in Equation 12.
The non-negativity property is important if we assume an inclusion-exclusion relation since it states that the unique, redundant, or synergetic information cannot be negative. If an atom
provides a negative partial contribution in the framework of Williams and Beer [
1], then this may indicate that we over-counted some information in its down-set.
Remark 4. Several additional axioms and properties have been suggested since the original proposal of Williams and Beer [1], such as target monotonicity and target chain rule [4]. However, this work will only consider the axioms and properties of Williams and Beer [1]. To the best of our knowledge, no other measure since the original proposal (discussed below) has been able to satisfy these properties for an arbitrary number of visible variables while ensuring an inclusion-exclusion relation for their partial contributions.
It is possible to convert between both representations due to a lattice duality:
Definition 14 (Lattice duality and dual decompositions [
23]).
Let be a redundancy lattice with associated measure and let be a synergy lattice with measure , then the two decompositions are said to be dual if and only if the down-set on one lattice corresponds to the up-set in the other as shown in Equation 14.
Williams and Beer [
1] proposed
, as shown in Equation
Section 2.2, to be used as measure of redundancy and demonstrated that it satisfies the three required axioms and local positivity. They define redundancy (Equation ) as the expected value of the minimum
specific information (Equation
14a).
Remark 5. Throughout this work, we use the term ’target pointwise information’ or simply ’pointwise information’ to refer to ’specific information’. This shall avoid confusion when naming their corresponding binary input channels in Section 3.
To the best of our knowledge, this measure is the only existing non-negative decomposition that satisfies all three axioms listed above for an arbitrary number of visible variables while providing an inclusion-exclusion relation of partial information.
However, the measure
could be criticized for not providing a notion of distinct information due to its use of a pointwise minimum (for each
) over the sources. This leads to the question of distinguishing ”the
same information and the
same amount of information“ [
3,
4,
5,
6]. We can use the definition through a pointwise minimum (Equation
Section 2.2) to construct examples of unexpected behavior: consider for example a uniform binary target variable
T and two visible variables as output of the channels visualized in
Figure 5. The channels are constructed to be equivalent for both target states and provide access to distinct decision regions while ensuring a constant pointwise information
. Even though our ability to predict the target variable significantly depends on which of the two indicated channel outputs we observe (blue or green in
Figure 5, incomparable informativity based on Definition 2), the measure
concludes full redundancy between them
. We think this behavior is undesired and, as discussed in the literature, caused by an underlying lack of distinguishing the
same information. To resolve this issue, we will present a representation of
f-information in
Section 3.1, which allows the use of all (TPR,FPR)-pairs for each state of the target variable to represent a distinct notion of uncertainty.
2.3. Information Measures
This section discusses two generalizations of mutual information at discrete random variables based on
f-divergences and Rényi divergences [
24,
25]. While mutual information has interpretational significance in channel coding and data compression, other
f-divergences have their significance in parameter estimations, high-dimensional statistics, and hypothesis testing ([
7], p. 88), while Rényi-divergences can be found among others in privacy analysis [
8]. Finally, we introduce Bhattacharyya information for demonstrating that it is possible to chain decomposition transformations in
Section 3.3. All definitions in this section only consider the case of discrete random variables (which is what we need for the context of this work).
Definition 15 (
f-divergence [
24]).
Let be a function that satisfies the following three properties.
By convention we understand that and . For any such function f and two discrete probability distributions P and Q over the event space , the f-divergence for discrete random variables is defined as shown in Equation 15.
Notation 6. Throughout this work, we reserve the name f for functions that satisfy the required properties for an f-divergence of Definition 15.
An
f-divergence quantifies a notion of dissimilarity between two probability distributions
P and
Q. Key properties of
f-divergences are their non-negativity, their invariance under bijective transformations, and them satisfying a data-processing inequality ([
7], p. 89). A list of commonly used
f-divergences is shown in
Table 1. Notably, the continuation for
of both the Hellinger- and
-divergence result in the KL-divergence [
26].
The generator function of an
f-divergence is not unique since
for a real constant
([
7], p. 90f). As a result, the considered
-divergence is a linear scaling of the Hellinger divergence (
) as shown in Equation
16.
Definition 16 (
f-information [
7]).
An f-information is defined based on an f-divergence from the joint distribution of two discrete random variables and the product of their marginals as shown in Equation 17.
Definition 17 (f-entropy). A notion of f-entropy for a discrete random variable is obtained from the self-information of a variable .
Notation 7. Using the KL-divergence results in the definition of mutual information and Shannon entropy. Therefore, we use the notation for mutual information (KL-information) and (KL-entropy ) for the Shannon entropy.
The remaining part of this section will define Rényi- and Bhattacharyya-information to highlight that they can be represented as an invertible transformation of Hellinger-information. This will be used in
Section 3.3 to transform the decomposition of Hellinger-information to a decomposition of Rényi- and Bhattacharyya-information.
Remark 6. We could similarly choose to represent Renyi divergence as a transformation of the α-divergence. A liner scaling of the considered f-divergence will however not affect our later results (see Section 3.3).
Definition 18 (Rényi divergence [
25]).
Let P and Q be two discrete probability distributions over the event space , then Rényi divergence is defined as shown in Equation 18 for , and extended to by continuation.
Notably, the continuation of Rényi divergence for
also equals the KL-divergence ([
7], p. 116). Renyi divergence can be expressed as an invertible transformation of the Hellinger divergence (
, see Equation
18) [
26].
Definition 19 (Rényi-information [
7]).
Rényi-information is defined equivalent to f-information as shown in Equation 19 and corresponds to an invertible transformation of Hellinger-information ().
Finally, we consider the Bhattacharyya distance (Definition 20), which is equivalent to a linear scaling from a special case of Rényi divergence (Equation
20) [
26]. It is applied, among others, in signal processing [
27] and coding theory [
28]. The corresponding information measure (Equation
21) is like its distance the scaling of a special case of Rényi-information.
Definition 20 (Bhattacharyya distance [
29]).
Let P and Q be two discrete probability distributions over the event space , then the Bhattacharyya distance is defined as shown in Equation 20.
Definition 21 (Bhattacharyya-information).
Bhattacharyya-information is defined equivalent to f-information as shown in Equation 21.
Example 2.
Consider the channel with and . While it will be discussed in more detail in Section 3.1, Equation 22 already indicates that f-information can be interpreted as the expected value of quantifying the boundary of the Neyman-Pearson region for an indicator variable of each target state . Each state of a source variable corresponds to one side/edge of this boundary as discussed in Section 2.1 and visualized in Figure 2. Therefore, the sum over corresponds to the sum of quantifying each edge of the zonogon by some function, which is only parameterized by the distribution of the indicator variable for t. This function satisfies a triangle inequality (Corollary A1) and the total boundary is non-negative (Theorem 2 discussed later). Therefore, we can vaguely think of pointwise f-information as quantifying the length of the boundary of the Neyman-Pearson region or zonogon perimeter to give an oversimplified intuition.
Below is a step-wise computation of -information () on a small example from this interpretation for the setting of Equation 2.
Since , we compute the pointwise information for two indicator variables as shown in Figure 6. Since each state corresponds to one edge of the zonogon, we compute them individually. Notice that the quantification of each vector can be expressed as a function that is only parameterized by the distribution of the indicator variable. The total zonogon perimeter is quantified to the sum of each of its edges, which equals pointwise information. In this particular case, we obtain for the total boundary on the indicator of and for the total boundary on the indicator of . The expected information corresponds to the expected value of these pointwise quantifications and provides the final result (Equation 24).