Non-Negative Decomposition of Multivariate Information: from Minimum to Blackwell Specific Information

Preprint

Article

Non-Negative Decomposition of Multivariate Information: from Minimum to Blackwell Specific Information

Altmetrics

Downloads

197

Views

218

Comments

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

22 April 2024

Posted:

25 April 2024

You are already at the latest version

Alerts

Abstract

Partial Information Decompositions (PIDs) aim to categorize how a set of source variables provide information about a target variable redundantly, uniquely, or synergetically. The original proposal for such an analysis used a lattice-based approach and gained significant attention. However, finding a suitable underlying decomposition measure is still an open research question, even at an arbitrary number of discrete random variables. This work proposes a solution to this case with a non-negative PID that satisfies an inclusion-exclusion relation for any f-information measure. The decomposition is constructed from a pointwise perspective of the target variable to take advantage of the equivalence between the Blackwell and zonogon order in this setting. The zonogons correspond to the Neyman-Pearson region for an indicator variable of one target state and f-information is the expected value of quantifying its boundary. We prove that the decomposition satisfies the axioms of the original decomposition framework and guarantees non-negative partial information results. We highlight that our decomposition behaves differently depending on the used information measure, which can be utilized for different applications. We additionally show how our proposal can be used to obtain a non-negative decomposition of Rényi-information at a transformed inclusion-exclusion relation, and for tracing partial information flows through Markov chains.

Keywords:

Subject: Computer Science and Mathematics - Probability and Statistics

1. Introduction

From computer science to neuroscience, we can find the following problem: We would like to know information about a random variable T, called the target, that we cannot observe directly. However, we can obtain information about the target indirectly from another set of variables

V = {V_{1}, . . ., V_{n}}

. We can use information measures to quantify how much information any set of variables provides about the target. When doing so, we can identify the concept of redundancy: For example, if we have two identical variables

V_{1} = V_{2}

, then we can use one variable to predict the other and thus anything that this other variable can predict. Similarly, we can identify the concept of synergy: For example, if we have two independent variables and a target that corresponds to their XOR operation

T = (V_{1} XOR V_{2})

, then both variables provide no advantage on their own for predicting the state of T, yet their combination fully determines it. Williams and Beer [1] suggested that it is possible to characterize information as visualized by the Venn diagram for two variables

V = {V_{1}, V_{2}}

in Figure 1a. This decomposition attributes the total information about the target to being redundant, synergetic, or unique to a particular variable. As indicated in Figure 1a by

I (\cdot, T)

, we can quantify three of the areas using information measures. However, this is insufficient to determine the four partial areas that represent the individual contributions. This causes the necessity to extend an information measure to either quantify the amount of redundancy or synergy between a set of variables.

Williams and Beer [1] first proposed a framework for Partial Information Decompositions (PIDs) and found favor by the community [2]. However, the proposed measure of redundancy was criticized for not distinguishing ”the same information and the same amount of information“ [3,4,5,6]. The proposal of Williams and Beer [1] focused specifically on mutual information. This work additionally studies the decomposition of any f-information or Rényi-information at discrete random variables. They have significance, among others, in parameter estimations, high-dimensional statistics, hypothesis testing, channel coding, data compression and privacy analyses [7,8].

1.1. Related Work

Most of the literature focuses on the decomposition of mutual information. Here, many alternative measures have been proposed but cannot fully replace the original measure of Williams and Beer [1] since they do not provide non-negative results for any

| V |

: The special case of bivariate partial information decompositions (

| V | = 2

) has been well studied and several non-negative decompositions for the framework of Williams and Beer [1] are known [5,9,10,11,12]. However, each of these decompositions provides negative partial information for

| V | > 2

. Further research [13,14,15] specifically aimed to define decompositions of mutual information for an arbitrary number of observable variables, but similarly obtain negative partial contributions and the resulting difficulty of interpreting their results. Griffith et al. [3] studied the decomposition of zero-error information and obtained negative partial contributions. Kolchinsky [16] proposed a decomposition framework for an arbitrary number of observable variables that is applicable beyond Shannon information theory, however, where the partial contributions do not sum to the total amount.

In this work, we propose a decomposition measure for replacing the one presented by Williams and Beer [1] while maintaining its desired properties. To achieve this, we combine several concepts from the literature: We use the Blackwell order, a preorder of information channels, for the decomposition and for deriving its operational interpretation, similar to Bertschinger et al. [9] and Kolchinsky [16]. We use its special case for binary input channels, the zonogon order studied by Bertschinger and Rauh [17], to achieve non-negativity at an arbitrary number of variables and provide it with a practical meaning by highlighting its equivalence to the Neyman-Pearson (decision) region. To utilize this special case for a general decomposition, we use the concept of a target pointwise decomposition as demonstrated by Williams and Beer [1] and related to Lizier et al. [18], Finn and Lizier [13], and Ince [14]. Specifically, we use Neyman-Pearson regions of an indicator variable for each target state to define distinct information and quantify pointwise information from its boundary. This allows for the non-negative decomposition of an arbitrary number of variables, where the source and target variables can have an arbitrary finite number of states. Finally, we apply the concepts from measuring on lattices, discussed by Knuth [19], to transform a non-negative decomposition with inclusion-exclusion relation from one information measure to another while maintaining the decomposition properties.

Remark 1.

We use the term ’target pointwise’ or simply ’pointwise’ within this work to refer to the analysis of each target state individually. This differs from [13,14,18], who use the latter term for the analysis of all joint sources-target realizations.

1.2. Contributions

In a recent work [20], we presented a decomposition of mutual information on the redundancy lattice (Figure 1b). This work aims to simplify, generalize and extend these ideas to make the following contributions to the area of Partial Information Decompositions:

We propose a representation of distinct uncertainty and distinct information, which is used to demonstrate the unexpected behavior of the measure by Williams and Beer [1] (Section 2.2 and Section 3).
We propose a decomposition for any f-information on both the redundancy lattice (Figure 1b) and synergy lattice (Figure 1c) that satisfies an inclusion-exclusion relation and provides a meaningful operational interpretation (Section 3.2).
We prove that the proposed decomposition satisfies the original axioms of Williams and Beer [1] and guarantees non-negative partial information (Theorem 3).
We propose to transform the non-negative decomposition of one information measure into another. This transformation maintains the non-negativity and its inclusion-exclusion relation under a re-definition of information addition (Section 3.3).
We demonstrate the transformation of an f-information decomposition into a decomposition for Rényi- and Bhattacharyya-information (Section 3.3).
We demonstrate that the proposed decomposition obtains different properties from different information measures and analyze the behavior of total variation in more detail (Section 4).
We demonstrate the analysis of partial information flows through Markov chains (Figure 1d) for each information measure on both the redundancy and synergy lattice (Section 4.2).

2. Background

This section aims to provide the required background information and introduce the used notation. Section 2.1 discusses the Blackwell order and its special case at binary targets, the zonogon order, which will be used for operational interpretations and the representation of f-information for its decomposition. Section 2.2 discusses the PID framework of Williams and Beer [1] and the relation between a decomposition based on the redundancy lattice and one based on the synergy lattice. We also demonstrate the unintuitive behavior of the original decomposition measure which will be resolved by our proposal in Section 3. Section 2.3 provides the considered definitions of f-information, Rényi-information, and Bhattacharyya information for the later demonstration of transforming decomposition results between measures.

Notation 1

(Random variables and their distribution). We use the notation T (upper case) to represent a random variable, ranging over the event space

T

(calligraphic) containing events

t \in T

(lower case), and use the notation

P_{T}

(P with subscript) to indicate its probability distribution. The same convention applies to other variables, such as a random variable S with events

s \in S

and distribution

P_{S}

. We indicate the outer product of two probability distributions as

P_{S} \otimes P_{T}

, which assigns the product of their marginals

P_{S} (s) \cdot P_{T} (t)

to each event

(s, t)

of the Cartesian product

S \times T

. Unless stated otherwise, we use the notation T, S and V to represent random variables throughout this work.

2.1. Blackwell and Zonogon Order

Definition 1

(Channel). A channel

μ = T \to S

from

T

S

represents a garbling of the input variable T that results in variable S. Within this work, we represent an information channel μ as (row) stochastic matrix, where each element is non-negative, and all rows sum to one.

For the context of this work, we consider a variable S to be the observation of the output from an information channel

T \to S

from the target variable T, such that the corresponding channel can be obtained from their conditional probability distribution, as shown in Equation 1 where

T = {t_{1}, . . ., t_{n}}

and

S = {s_{1}, . . ., s_{m}}

μ = (T \to S) = P_{(S ∣ T)} = [\begin{matrix} p (s_{1} ∣ t_{1}) & \dots & p (s_{m} ∣ t_{1}) \\ ⋮ & ⋱ & ⋮ \\ p (s_{1} ∣ t_{n}) & \dots & p (s_{m} ∣ t_{n}) \end{matrix}]

(1)

Notation 2

(Binary input channels). Throughout this work, we reserve the symbol κ for binary input channels, meaning κ signals a stochastic matrix of dimension

2 \times m

. We use the notation

\vec{v} \in κ

to indicate a column of this matrix.

Definition 2

(More informative [17,21]). An information channel

μ_{1} = T \to S_{1}

is more informative than another channel

μ_{2} = T \to S_{2}

if - for any decision problem involving a set of actions

a \in Ω

and a reward function

u : (Ω, T) \to R

that depends on the chosen action and state of the variable T - an agent with access to

S_{1}

can always achieve an expected reward at least as high as another agent with access to

S_{2}

Definition 3

(Blackwell order [17,21]). The Blackwell order is a preorder of channels. A channel

μ_{1}

is Blackwell superior to channel

μ_{2}

, if we can pass its output through a second channel λ to obtain an equivalent channel to

μ_{2}

, as shown in Equation 2.

μ_{2} ⊑ μ_{1} ⟺ μ_{2} = μ_{1} \cdot λ for some stochastic matrix λ

(2)

Blackwell [21] showed that a channel is more informative if and only if it is Blackwell superior. Bertschinger and Rauh [17] showed that the Blackwell order does not form a lattice for channels

μ = T \to S

| T | > 2

since the ordering does not provide unique meet and join elements. However, binary target variables

| T | = 2

are a special case where the Blackwell order is equivalent to the zonogon order (discussed next) and does form a lattice [17].

Definition 4

(Zonogon [17]). The zonogon

Z (κ)

of a binary input channel

κ = T \to S

is defined using the Minkowski sum from the collection of vector segments as shown in Equation 3. The zonogon

Z (κ)

can similarly be defined as image of the unit cube

{[0, 1]}^{| S |}

under the linear map of κ.

Z (κ) \{\sum_{i} x_{i} {\vec{v}}_{i} : 0 \leq x_{i} \leq 1, {\vec{v}}_{i} \in κ\} = \{κ a : a \in {[0, 1]}^{| S |}\}

(3)

The zonogon

Z (κ)

is a centrally symmetric convex polygon, and the set of vectors

{\vec{v}}_{i} \in κ

span its perimeter. Figure 2 shows the example of a binary input channel and its corresponding zonogon.

Definition 5

(Zonogon sum). The addition of two zonogons corresponds to their Minkowski sum as shown in Equation 4.

Z (κ_{1}) + Z (κ_{2}) \{a + b : a \in Z (κ_{1}), b \in Z (κ_{2})\} = Z ([\begin{matrix} κ_{1} & κ_{2} \end{matrix}])

(4)

Definition 6

(Zonogon order [17]). A zonogon

Z (κ_{1})

is zonogon superior to another

Z (κ_{2})

if and only if

Z (κ_{2}) \subseteq Z (κ_{1})

Bertschinger and Rauh [17] showed that for binary input channels, the zonogon order is equivalent to the Blackwell order and forms a lattice (Equation 5). In the remaining work, we will only discuss binary input channels, such that the orderings of Definition 2, 3 and 6 are equivalent and can be thought of as zonogons with subset relation.

κ_{1} ⊑ κ_{2} ⟺ Z (κ_{1}) \subseteq Z (κ_{2})

(5)

For obtaining an interpretation of what a channel zonogon

Z (κ)

represents, we can consider a binary decision problem by aiming to predict the state

t \in T

of a binary target variable T using the output of channel

κ = T \to S

. Any decision strategy

λ \in {[0, 1]}^{| S | \times 2}

for obtaining a binary prediction

\hat{T}

can be fully characterized by its resulting pair of True-Positive Rate (TPR) and False-Positive Rate (FPR), as shown in Equation 6

κ \cdot λ = (T \to S \to \hat{T}) = P_{(\hat{T} ∣ T)} = [\begin{matrix} p (\hat{T} = t ∣ T = t) & p (\hat{T} \neq t ∣ T = t) \\ p (\hat{T} = t ∣ T \neq t) & p (\hat{T} \neq t ∣ T \neq t) \end{matrix}] = [\begin{matrix} TPR & 1 - TPR \\ FPR & 1 - FPR \end{matrix}]

(6)

Therefore, a channel zonogon

Z (κ)

provides the set of all achievable (TPR,FPR)-pairs for a given channel

κ

[20,22]. This can also be seen from Equation 3, where the unit cube

a \in {[0, 1]}^{| S |}

represents all possible first columns of the decision strategy

λ

. The first column of

λ

fully determines the second since each row has to sum to one. As a result,

κ a

provides the (TPR,FPR)-pair for the decision strategy

λ = [\begin{matrix} a & (1 - a) \end{matrix}]

and the definition of Equation 3 all achievable (TPR,FPR)-pairs for predicting the state of a binary target variable. Since this will be helpful for operational interpretations, we label the axis of zonogon plots accordingly, as shown in Figure 2. The zonogon ([17] p. 2480) is the Neyman-Pearson region ([7] p. 231).

Definition 7

(Neyman-Pearson region [7] & decision regions). The Neyman-Pearson region for a binary decision problem is the set of achievable (TPR,FPR)-pairs and can be visualized as shown in Figure 2. The Neyman-Pearson regions underlie the zonogon order and their boundary can be obtained from the likelihood-ratio test. We refer to subsets of the Neyman-Pearson region as reachable decision regions, or simply decision regions, and its boundary as zonogon perimeter.

Remark 2.

Due to the zonogon symmetry, the diagram labels can be swapped (FPR x-axis/TPR y-axis) which changes the interpretation to aiming a prediction for

T \neq t

Notation 3

(Channel lattice). We use the notation

κ_{1} ⊓ κ_{2}

for the meet element of binary input channels under the Blackwell order and

κ_{1} ⊔ κ_{2}

for their join element. We use the notation

⊤_{BW} = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}]

for the top element of binary input channels under the Blackwell order and

⊥_{BW} = [\begin{matrix} 1 \\ 1 \end{matrix}]

for the bottom element.

For binary input channels, the meet element of the Blackwell order corresponds to the zonogon intersection

Z (κ_{1} ⊓ κ_{2}) = Z (κ_{1}) \cap Z (κ_{2})

and the join element of the Blackwell order corresponds to the convex hull of their union

Z (κ_{1} ⊔ κ_{2}) = C o n v (Z (κ_{1}) \cup Z (κ_{2}))

. Equation 7 describes this for an arbitrary number of channels.

Z (⊓_{κ \in A} κ) = ⋂_{κ \in A} Z (κ) a n d Z (⨆_{κ \in A} κ) = C o n v (⋃_{κ \in A} Z (κ))

(7)

Example 1.

The remaining work only analyzes indicator variables, so we only need to consider the case

| T | = 2

where all presented ordering relations of this section are equivalent and form a lattice.

Figure 3a visualizes a channel

T \overset{κ}{\to} S

with

| S | = 3

. We can use the observations of S for making a prediction

\hat{T}

about T. For example, we predict that T is in its first state with probability

w_{1}

if S is in its first state, with probability

w_{2}

if S is in its second state and with probability

w_{3}

if S is in its third state. This randomized decision strategies can be noted as stochastic matrix λ shown in Figure 3a. The resulting TPR and FPR of this decision strategy is obtained from the weighted sum of these parameters (

w_{1}

w_{2}

and

w_{3}

) with the vectors in κ. Each decision strategy corresponds to a point within the zonogon, since the probabilities are constrained by

w_{1}, w_{2}, w_{3} \in [0, 1]

and the resulting zonogon is the Neyman-Pearson region.

Figure 3b visualizes an example for the discussed ordering relations, where all observable variables have two states:

| S_{i} | = 2

where

i \in {1, 2, 3}

. The zonogon/Neyman-Pearson region corresponding to variable

S_{3}

is fully contained within the others (

Z (κ_{3}) \subseteq Z (κ_{1})

and

Z (κ_{3}) \subseteq Z (κ_{2})

). Therefore, we can say that

S_{3}

is Blackwell inferior (Definition 3) and less informative (Definition 2) than

S_{1}

and

S_{2}

about T. Practically, this means that we can construct an equivalent variable to

S_{3}

by garbling

S_{1}

S_{2}

and that for any sequence of actions based on

S_{3}

and any reward function with dependence on T, we can achieve an expected reward at least a high by acting based on

S_{1}

S_{2}

instead. The variables

S_{1}

and

S_{2}

are incomparable from the zonogon order, Blackwell order, and informativity order, since the Neyman-Pearson region of one is not fully contained in the other.

The zonogon shown in Figure 3a corresponds to the join under the zonogon order, Blackwell order and informativity order of

S_{1}

and

S_{2}

in Figure 3b about T. For binary targets, this distribution can directly be obtained from the convex hull of their Neyman-Pearson regions and corresponds to a valid joint distribution for

(T, S_{1}, S_{2})

. All other joint distributions are either equivalent or superior to it. When doing this on indicator variables for

| T | > 2

, then the obtained joint distributions for each

t \in T

may not combine into a specific valid overall joint distribution.

2.2. Partial Information Decomposition

The commonly used framework for PIDs was introduced by Williams and Beer [1]. A PID is computed with respect to a particular random variable that we would like to know information about, called the target, and tries to identify from which variables that we have access to, called visible variables, we obtain this information. Therefore, this section considers sets of variables that represent their joint distribution.

Notation 4.

Throughout this work, we use the notation T for the target variable and

V = {V_{1}, . . ., V_{n}}

for the set of visible variables. We use the notation

P (V)

for the power set of

V

, and

P_{1} (V) = P (V) ∖ \emptyset

for its power set without the empty set.

Definition 8

(Sources, Atoms [1]).

A source $S_{i} \in P_{1} (V)$ is a non-empty set of visible variables.
An atom $α \in A (V)$ is a set of sources constructed by Equation 8.

$A (V) = {α \in P_{1} (P_{1} (V)) : \forall S_{a}, S_{b} \in α, S_{a} \neg \subset S_{b}},$

(8)

The used filter for obtaining the set of atoms (Equation 8) removes sets that would be equivalent to other elements. This is required for obtaining a lattice from the following two ordering relations:

Definition 9

(Redundancy-/Gain-lattice [1]). The redundancy lattice

(A (V), ≼)

is obtained by applying the ordering relation of Equation 9 to all atoms

α, β \in A (V)

α ≼ β ⟺ \forall S_{b} \in β, \exists S_{a} \in α, S_{a} \subseteq S_{b}

(9)

The redundancy lattice for three visible variables is visualized in Figure 4a. On this lattice, we can think of an atom as representing the information that can be obtained from all of its sources about the target T (their redundancy or informational intersection). For example, the atom

α = {{V_{1}, V_{2}}, {V_{1}, V_{3}}}

represents on the redundancy lattice the information that is contained in both

(V_{1}, V_{2})

and

(V_{1}, V_{3})

about T. Since both sources in

α

provide the information of

V_{1}

, their redundancy contains at least this information, and the atom

β = {{V_{1}}}

is considered its predecessor. Therefore, the ordering indicates an informational subset relation for the redundancy of atoms, and the information that is represented by an atom increases as we move up. The up-set of an atom

α

on the redundancy lattice indicates the information that is lost when losing all of its sources. Considering the example from above, if we lose access to

{V_{1} (or) V_{2}}

and

{V_{1} (or) V_{3}}

, then we lose access to all atoms in the up-set of

α = {{V_{1}, V_{2}}, {V_{1}, V_{3}}}

Definition 10

(Synergy-/Loss-lattice [23]). The synergy lattice

(A (V), ⪯)

is obtained by applying the ordering relation of Equation 10 to all atoms

α, β \in A (V)

α ⪯ β ⟺ \forall S_{b} \in β, \exists S_{a} \in α, S_{b} \subseteq S_{a}

(10)

The synergy lattice for three visible variables is visualized in Figure 4b. On this lattice, we can think of an atom as representing the information that is contained in neither of its sources (information outside their union). For example, the atom

α = {{V_{1}, V_{2}}, {V_{1}, V_{3}}}

represents on the synergy lattice the information that is obtained from neither

(V_{1}, V_{2})

nor

(V_{1}, V_{3})

about T. The ordering again indicates their expected subset relation: the information that is obtained from neither

{V_{1} (and) V_{2}}

nor

{V_{1} (and) V_{3}}

is fully contained in the information that cannot be obtained from

β = {{V_{1}}}

and thus

α

is a predecessor of

β

With an intuition for both ordering relations in mind, we can see how the filter in the construction of atoms (Equation 8) removes sets that would be equivalent to another atom: the set

{{V_{1}, V_{2}}, {V_{1}}}

is removed from the power set of sources since it would be equivalent to the atom

{{V_{1}}}

under the ordering of the redundancy lattice and equivalent to the atom

{{V_{1}, V_{2}}}

under the ordering of the synergy lattice.

Notation 5

(Redundancy/Synergy lattices). We use the notation

(A (V), ⋎, ⋏)

for the join and meet operators on the redundancy lattice, and

(A (V), \lor, \land)

for the join and meet operators on the synergy lattice. We use the notation

⊤_{RL} = {V}

for the top and

⊥_{RL} = \emptyset

for the bottom atom on the redundancy lattice, and

⊤_{SL} = \emptyset

and

⊥_{SL} = {V}

for the top and bottom atom on the synergy lattice. For an atom α, we use the notation

↓ α

for its down-set,

\dot{↓} α

for its strict down-set, and

α^{-}

for its cover set. These definitions will only appear in the Möbius inverse of a function that is directly associated with either the synergy or redundancy lattice such that there is no ambiguity about which ordering relation has to be considered.

The redundant, unique, or synergetic information (partial contributions) can be calculated based on either lattice. They are obtained by quantifying each atom of the redundancy or synergy lattice with a cumulative measure that increases as we move up in the lattice. The partial contributions are then obtained in a second step from a Möbius inverse.

Definition 11

([Cumulative] redundancy measure [1]). A redundancy measure

I_{\cap} (α; T)

is a function that assigns a real value to each atom of the redundancy lattice. It is interpreted as a cumulative information measure that quantifies the redundancy between all sources

S \in α

of an atom

α \in A (V)

about the target T.

Definition 12

([Cumulative] loss measure [23]). A loss measure

I_{\cup} (α; T)

is a function that assigns a real value to each atom of the synergy lattice. It is interpreted as a cumulative measure that quantifies the information about T that is provided by neither of the sources

S \in α

of an atom

α \in A (V)

To ensure that a redundancy measure actually captures the desired concept of redundancy, Williams and Beer [1] defined three axioms that a measure

I_{\cap}

should satisfy. For the synergy lattice, we consider the equivalent axioms discussed by Chicharro and Panzeri [23]:

Axiom 1 (Commutativity [1,23]).

Invariance in the order of sources (

σ

permuting the order of indices):

\begin{matrix} I_{\cap} ({S_{1}, . . ., S_{i}}; T) & = I_{\cap} ({S_{σ (1)}, . . ., S_{σ (i)}}; T) \\ I_{\cup} ({S_{1}, . . ., S_{i}}; T) & = I_{\cup} ({S_{σ (1)}, . . ., S_{σ (i)}}; T) \end{matrix}

Axiom 2 (Monotonicity [1,23]).

Additional sources can only decrease redundant information. Additional sources can only decrease the information that is in neither source.

\begin{matrix} I_{\cap} ({S_{1}, . . ., S_{i - 1}}; T) & \geq I_{\cap} ({S_{1}, . . ., S_{i}}; T) \\ I_{\cup} ({S_{1}, . . ., S_{i - 1}}; T) & \geq I_{\cup} ({S_{1}, . . ., S_{i}}; T) \end{matrix}

Axiom 3 (Self-redundancy [1,23]).

For a single source, redundancy equals mutual information. For a single source, the information loss equals the difference between the total available mutual information and the mutual information of the considered source with the target.

\begin{matrix} I_{\cap} ({S_{i}}; T) = I (S_{i}; T) and I_{\cup} ({S_{i}}; T) = I (V; T) - I (S_{i}; T) \end{matrix}

The first axiom states that an atom’s redundancy and information loss should not depend on the order of its sources. The second axiom states that adding sources to an atom can only decrease the redundancy of all sources (redundancy lattice) and decrease the information from neither source (synergy lattice). The third axiom binds the measures to be consistent with mutual information and ensures that the bottom element of both lattices is quantified to zero.

Once a lattice with corresponding cumulative measure (

I_{\cap}

I_{\cup}

) is defined, we can use the Möbius inverse to compute the partial contribution of each atom. This partial information can be visualized as partial area in a Venn diagram (see Figure 1a) and corresponds to the desired redundant, unique, and synergetic contributions. However, the same atom represents different partial contributions on each lattice: As visualized for the case of two visible variables in Figure 1, the unique information of variable

V_{1}

is represented by

α = {{V_{1}}}

on the redundancy lattice and by

β = {{V_{2}}}

on the synergy lattice.

Definition 13

(Partial information [1,23]). Partial information

Δ I_{\cap} (α; T)

and

Δ I_{\cup} (α; T)

corresponds to the Möbius inverse of its corresponding cumulative measure on the respective lattice.

\begin{matrix} Redundancy - Lattice : & Δ I_{\cap} (α; T) & = I_{\cap} (α; T) - \sum_{β \in \dot{↓} α} Δ I_{\cap} (β; T), \end{matrix}

(11a)

\begin{matrix} Synergy - Lattice : & Δ I_{\cup} (α; T) & = I_{\cup} (α; T) - \sum_{β \in \dot{↓} α} Δ I_{\cup} (β; T) . \end{matrix}

(11b)

Remark 3.

Using the Möbius inverse for defining partial information enforces an inclusion-exclusion relation in that all partial information contributions have to sum to the corresponding cumulative measure. Kolchinsky [16] argues that an inclusion-exclusion relation should not be expected to hold for PIDs and proposes an alternative decomposition framework. In this case, the sum of partial contributions (unique/redundant/synergetic information) is no longer expected to sum to the total amount

I (V; T)

Property 1

(Local positivity, non-negativity [1]). A partial information decomposition satisfies non-negativity or local positivity if its partial information contributions are always non-negative, as shown in Equation 12.

\forall α \in A (V) . Δ I_{\cap} (α; T) \geq 0 or Δ I_{\cup} (α; T) \geq 0

(12)

The non-negativity property is important if we assume an inclusion-exclusion relation since it states that the unique, redundant, or synergetic information cannot be negative. If an atom

α

provides a negative partial contribution in the framework of Williams and Beer [1], then this may indicate that we over-counted some information in its down-set.

Remark 4.

Several additional axioms and properties have been suggested since the original proposal of Williams and Beer [1], such as target monotonicity and target chain rule [4]. However, this work will only consider the axioms and properties of Williams and Beer [1]. To the best of our knowledge, no other measure since the original proposal (discussed below) has been able to satisfy these properties for an arbitrary number of visible variables while ensuring an inclusion-exclusion relation for their partial contributions.

It is possible to convert between both representations due to a lattice duality:

Definition 14

(Lattice duality and dual decompositions [23]). Let

C = (A (V), ≼)

be a redundancy lattice with associated measure

I_{\cap}

and let

D = (A (V), ⪯)

be a synergy lattice with measure

I_{\cup}

, then the two decompositions are said to be dual if and only if the down-set on one lattice corresponds to the up-set in the other as shown in Equation 14.

\begin{matrix} \forall α \in C, \exists β \in D & : Δ I_{\cap} (α; T) = Δ I_{\cup} (β; T) \end{matrix}

(13a)

\begin{matrix} \forall α \in D, \exists β \in C & : Δ I_{\cup} (α; T) = Δ I_{\cap} (β; T) \end{matrix}

(13b)

\begin{matrix} \forall α \in C, \exists β \in D & : I_{\cap} (α; T) = \sum_{γ \in ↓ α} Δ I_{\cap} (γ; T) = \sum_{γ \in ↑ β} Δ I_{\cup} (γ; T) \end{matrix}

(13c)

\begin{matrix} \forall α \in D, \exists β \in C & : I_{\cup} (α; T) = \sum_{γ \in ↓ α} Δ I_{\cup} (γ; T) = \sum_{γ \in ↑ β} Δ I_{\cap} (γ; T) \end{matrix}

(13d)

Williams and Beer [1] proposed

I_{\cap}^{min}

, as shown in Equation Section 2.2, to be used as measure of redundancy and demonstrated that it satisfies the three required axioms and local positivity. They define redundancy (Equation ) as the expected value of the minimum specific information (Equation 14a).

Remark 5.

Throughout this work, we use the term ’target pointwise information’ or simply ’pointwise information’ to refer to ’specific information’. This shall avoid confusion when naming their corresponding binary input channels in Section 3.

\begin{matrix} I (S_{i}; T = t) & = \sum_{s \in S_{i}} p (s ∣ t) [log (\frac{1}{p (t)}) - log (\frac{1}{p (t ∣ s)})] \end{matrix}

(14a)

\begin{matrix} I_{\cap}^{min} (S_{1}, . . ., S_{k}; T) & = \sum_{t \in T} p (t) min_{i \in 1 . . k} I (S_{i}; T = t) . \end{matrix}

(14b)

To the best of our knowledge, this measure is the only existing non-negative decomposition that satisfies all three axioms listed above for an arbitrary number of visible variables while providing an inclusion-exclusion relation of partial information.

However, the measure

I_{\cap}^{min}

could be criticized for not providing a notion of distinct information due to its use of a pointwise minimum (for each

t \in T

) over the sources. This leads to the question of distinguishing ”the same information and the same amount of information“ [3,4,5,6]. We can use the definition through a pointwise minimum (Equation Section 2.2) to construct examples of unexpected behavior: consider for example a uniform binary target variable T and two visible variables as output of the channels visualized in Figure 5. The channels are constructed to be equivalent for both target states and provide access to distinct decision regions while ensuring a constant pointwise information

\forall t \in T : I (V_{x}, T = t) = 0.2

. Even though our ability to predict the target variable significantly depends on which of the two indicated channel outputs we observe (blue or green in Figure 5, incomparable informativity based on Definition 2), the measure

I_{\cap}^{min}

concludes full redundancy between them

I (V_{1}; T) = I_{\cap}^{min} ({V_{1}, V_{2}}; T) = I (V_{2}, T) = 0.2

. We think this behavior is undesired and, as discussed in the literature, caused by an underlying lack of distinguishing the same information. To resolve this issue, we will present a representation of f-information in Section 3.1, which allows the use of all (TPR,FPR)-pairs for each state of the target variable to represent a distinct notion of uncertainty.

2.3. Information Measures

This section discusses two generalizations of mutual information at discrete random variables based on f-divergences and Rényi divergences [24,25]. While mutual information has interpretational significance in channel coding and data compression, other f-divergences have their significance in parameter estimations, high-dimensional statistics, and hypothesis testing ([7], p. 88), while Rényi-divergences can be found among others in privacy analysis [8]. Finally, we introduce Bhattacharyya information for demonstrating that it is possible to chain decomposition transformations in Section 3.3. All definitions in this section only consider the case of discrete random variables (which is what we need for the context of this work).

Definition 15

(f-divergence [24]). Let

f : (0, \infty) \to R

be a function that satisfies the following three properties.

f is convex,
$f (1) = 0$ ,
$f (z)$ is finite for all $z > 0$ .

By convention we understand that

f (0) = {lim}_{z \to 0^{+}} f (z)

and

0 f (\frac{0}{0}) = 0

. For any such function f and two discrete probability distributions P and Q over the event space

X

, the f-divergence for discrete random variables is defined as shown in Equation 15.

D_{f} (P ‖ Q) \sum_{x \in X} Q (x) f (\frac{P (x)}{Q (x)}) = E_{Q} [f (\frac{P (X)}{Q (X)})]

(15)

Notation 6.

Throughout this work, we reserve the name f for functions that satisfy the required properties for an f-divergence of Definition 15.

An f-divergence quantifies a notion of dissimilarity between two probability distributions P and Q. Key properties of f-divergences are their non-negativity, their invariance under bijective transformations, and them satisfying a data-processing inequality ([7], p. 89). A list of commonly used f-divergences is shown in Table 1. Notably, the continuation for

a = 1

of both the Hellinger- and

α

-divergence result in the KL-divergence [26].

The generator function of an f-divergence is not unique since

D_{f (z)} = D_{f (z) + c (z - 1)}

for a real constant

c \in R

([7], p. 90f). As a result, the considered

α

-divergence is a linear scaling of the Hellinger divergence (

D_{H_{a}} = a \cdot D_{α = a}

) as shown in Equation 16.

\frac{z^{a} - 1}{a - 1} + c (z - 1) = a \cdot \frac{z^{a} - 1 - a (z - 1)}{a (a - 1)} for c = - \frac{a}{a - 1}

(16)

Definition 16

(f-information [7]). An f-information is defined based on an f-divergence from the joint distribution of two discrete random variables and the product of their marginals as shown in Equation 17.

\begin{matrix} I_{f} (S; T) & D_{f} (P_{(S, T)} ‖ P_{S} \otimes P_{T}) \\ = \sum_{(s, t) \in S \times T} P_{S} (s) \cdot P_{T} (t) \cdot f (\frac{P_{(S, T)} (s, t)}{P_{S} (s) \cdot P_{T} (t)}) \\ = \sum_{t \in T} P_{T} (t) [\sum_{s \in S} P_{S} (s) \cdot f (\frac{P_{S ∣ T} (s ∣ t)}{P_{S} (s)})] \end{matrix}

(17)

Definition 17

(f-entropy). A notion of f-entropy for a discrete random variable is obtained from the self-information of a variable

H_{f} (T) I_{f} (T; T)

Notation 7.

Using the KL-divergence results in the definition of mutual information and Shannon entropy. Therefore, we use the notation

I_{KL}

for mutual information (KL-information) and

H_{KL}

(KL-entropy ) for the Shannon entropy.

The remaining part of this section will define Rényi- and Bhattacharyya-information to highlight that they can be represented as an invertible transformation of Hellinger-information. This will be used in Section 3.3 to transform the decomposition of Hellinger-information to a decomposition of Rényi- and Bhattacharyya-information.

Remark 6.

We could similarly choose to represent Renyi divergence as a transformation of the α-divergence. A liner scaling of the considered f-divergence will however not affect our later results (see Section 3.3).

Definition 18

(Rényi divergence [25]). Let P and Q be two discrete probability distributions over the event space

X

, then Rényi divergence

R_{a}

is defined as shown in Equation 18 for

a \in (0, 1) \cup (1, \infty)

, and extended to

a \in {0, 1, \infty}

by continuation.

\begin{matrix} R_{a} (P ‖ Q) & \frac{1}{a - 1} log (E_{Q} [{(\frac{P (X)}{Q (X)})}^{a}]) \\ = \frac{1}{a - 1} log (1 + (a - 1) E_{Q} [\frac{{(\frac{P (X)}{Q (X)})}^{a} - 1}{a - 1}]) \\ = \frac{1}{a - 1} log (1 + (a - 1) D_{H_{a}} (P ‖ Q)) \end{matrix}

(18)

Notably, the continuation of Rényi divergence for

a = 1

also equals the KL-divergence ([7], p. 116). Renyi divergence can be expressed as an invertible transformation of the Hellinger divergence (

D_{H_{a}}

, see Equation 18) [26].

Definition 19

(Rényi-information [7]). Rényi-information is defined equivalent to f-information as shown in Equation 19 and corresponds to an invertible transformation of Hellinger-information (

I_{H_{a}}

\begin{matrix} I_{R_{a}} (S; T) & R_{a} (P_{(S; T)} ‖ P_{S} \otimes P_{T}) \\ = \frac{1}{a - 1} log (1 + (a - 1) I_{H_{a}} (S; T)) \end{matrix}

(19)

Finally, we consider the Bhattacharyya distance (Definition 20), which is equivalent to a linear scaling from a special case of Rényi divergence (Equation 20) [26]. It is applied, among others, in signal processing [27] and coding theory [28]. The corresponding information measure (Equation 21) is like its distance the scaling of a special case of Rényi-information.

Definition 20

(Bhattacharyya distance [29]). Let P and Q be two discrete probability distributions over the event space

X

, then the Bhattacharyya distance is defined as shown in Equation 20.

\begin{matrix} B (P ‖ Q) & - log (\sum_{x \in X} \sqrt{P (x) Q (x)}) \\ = - log (\sum_{x \in X} Q (x) \sqrt{\frac{P (x)}{Q (x)}}) \\ = - log (1 - 0.5 \cdot E_{Q} [\frac{{(\frac{P (X)}{Q (X)})}^{0.5} - 1}{0.5 - 1}]) \\ = - log (1 - 0.5 \cdot D_{H_{0.5}} (P ‖ Q)) \\ = 0.5 \cdot R_{0.5} (P ‖ Q) \end{matrix}

(20)

Definition 21

(Bhattacharyya-information). Bhattacharyya-information is defined equivalent to f-information as shown in Equation 21.

I_{B} (S; T) B (P_{(S, T)} ‖ P_{S} \otimes P_{T}) = 0.5 \cdot I_{R_{0.5}} (S; T)

(21)

Example 2.

Consider the channel

T \overset{κ}{\to} S

with

T = {t_{1}, t_{2}}

and

S = {s_{1}, s_{2}}

. While it will be discussed in more detail in Section 3.1, Equation 22 already indicates that f-information can be interpreted as the expected value of quantifying the boundary of the Neyman-Pearson region for an indicator variable of each target state

t \in T

. Each state of a source variable

s \in S

corresponds to one side/edge of this boundary as discussed in Section 2.1 and visualized in Figure 2. Therefore, the sum over

s \in S

corresponds to the sum of quantifying each edge of the zonogon by some function, which is only parameterized by the distribution of the indicator variable for t. This function satisfies a triangle inequality (Corollary A1) and the total boundary is non-negative (Theorem 2 discussed later). Therefore, we can vaguely think of pointwise f-information as quantifying the length of the boundary of the Neyman-Pearson region or zonogon perimeter to give an oversimplified intuition.

(22)

Below is a step-wise computation of

χ^{2}

-information (

f (z) = {(z - 1)}^{2}

) on a small example from this interpretation for the setting of Equation 2.

\begin{matrix} κ = P_{S ∣ T} & = [\begin{matrix} p (S = s_{1} ∣ T = t_{1}) & p (S = s_{2} ∣ T = t_{1}) \\ p (S = s_{1} ∣ T = t_{2}) & p (S = s_{2} ∣ T = t_{2}) \end{matrix}] = [\begin{matrix} 0.8 & 0.2 \\ 0.35 & 0.65 \end{matrix}] \end{matrix}

(23a)

\begin{matrix} P_{T} & = [\begin{matrix} p (T = t_{1}) & p (T = t_{2}) \end{matrix}] = [\begin{matrix} 0.4 & 0.6 \end{matrix}] \end{matrix}

(23b)

Since

| T | = 2

, we compute the pointwise information for two indicator variables as shown in Figure 6. Since each state

s \in S

corresponds to one edge of the zonogon, we compute them individually. Notice that the quantification of each vector

v_{s_{i}}

can be expressed as a function that is only parameterized by the distribution of the indicator variable. The total zonogon perimeter is quantified to the sum of each of its edges, which equals pointwise information. In this particular case, we obtain

0.292653

for the total boundary on the indicator of

t_{1}

and

0.130068

for the total boundary on the indicator of

t_{2}

. The expected information corresponds to the expected value of these pointwise quantifications and provides the final result (Equation 24).

I_{χ^{2}} (S; T) = p (T = t_{1}) \cdot 0.292653 + p (T = t_{2}) \cdot 0.130068 = 0.195102

(24)

3. Decomposition Methodology

To construct a partial information decomposition in the framework of Williams and Beer [1], we only have to define a cumulative redundancy measure (

I_{\cap}

) or cumulative loss measure (

I_{\cup}

). However, doing this requires a meaningful definition of when information is the same. Therefore, Section 3.1 presents an interpretation of f-information that enables a representation of distinct information. Specifically, we demonstrate that pointwise f-information for a target state

t \in T

corresponds to the Neyman-Pearson region of its indicator variable, which is quantified by its boundary (zonogon perimeter). This allows for the interpretation that each distinct (TPR,FPR)-pair for predicting a state of the target variable provides a distinct notion of uncertainty. This interpretation of f-information is used in Section 3.2 to construct a partial information decomposition under the Blackwell order for each state

t \in T

individually. These individual decompositions are then combined into the final result. Therefore, we decompose specific information based on the Blackwell order rather than using its minimum, like Williams and Beer [1]. We use the resulting decomposition of any f-information in Section 3.3 to transform a Hellinger-information decomposition into a Rényi-information decomposition while maintaining its non-negativity and an inclusion-exclusion relation. In Section 3.2 and Section 3.3, we first demonstrate the decomposition on the synergy lattice and then its corresponding dual decomposition on the redundancy lattice. To achieve the desired axioms and properties, we combine different aspects of the existing literature:

Like Bertschinger et al. [9] and Kolchinsky [16] we base the decomposition on the Blackwell order and use this to obtain the operational interpretation of the decomposition.
Like Williams and Beer [1] and related to Lizier et al. [18], Finn and Lizier [13], and Ince [14], we perform a decomposition from a pointwise perspective but only for the target variable.
In a similar manner to how Finn and Lizier [13] used probability mass exclusion to differentiate distinct information, we use Neyman-Pearson regions for each state of a target variable to differentiate distinct information.
We propose applying the concepts about lattice re-graduations discussed by Knuth [19] to PIDs to transform the decomposition of one information measure to another while maintaining its consistency.

We extend Axiom Section 2.2 of Williams and Beer [1] as shown below, to allow binding any information measure to the decomposition. [Self-redundancy] For a single source, redundancy

I_{\cap, *}

and information loss

I_{\cup, *}

correspond to information measure

I_{*}

as shown below:

I_{\cap, *} ({S_{i}}; T) = I_{*} (S_{i}; T) and I_{\cup, *} ({S_{i}}; T) = I_{*} (V; T) - I_{*} (S_{i}; T)

(25)

3.1. Representing f-Information

We begin with an interpretation of f-information, for which we define a pointwise (indicator) variable

π (T, t)

that represents one state of the target variable (Equation 26a) and construct its pointwise information channel (Definition 22). Then, we define a function

r_{f}

based on the generator function of an f-divergence. This function acts like a pseudo-distance for measuring half the length the zonogon perimeter of each pointwise information channel (see Figure 2). These zonogon perimeter lengths are pointwise f-information.

Definition 22

([Target] pointwise binary input channel). We define a target pointwise binary input channel

κ (S, T, t)

from one state of the target variable

t \in T

to an information source

S

with event space

S = {s_{1}, \dots, s_{m}}

as shown in Equation 26b.

\begin{matrix} π (T, t) & \{\begin{matrix} 1 & if T = t \\ 0 & otherwise \end{matrix} \end{matrix}

(26a)

\begin{matrix} κ (S, T, t) π (T, t) \to S & = [\begin{matrix} p (S = s_{1} ∣ T = t) & \dots & p (S = s_{m} ∣ T = t) \\ p (S = s_{1} ∣ T \neq t) & \dots & p (S = s_{m} ∣ T \neq t) \end{matrix}] \end{matrix}

(26b)

Definition 23

([Target] pointwise f-information).

We define a function $r_{f}$ as shown in Equation 27a to quantify a vector, where $0 \leq p, x, y \leq 1$ .
We define a target pointwise f-information function $i_{f}$ , as shown in Equation 27b, to quantify half the zonogon perimeter for the corresponding pointwise channel $Z (κ (S, T, t))$ .

\begin{matrix} r_{f} (p, [\begin{matrix} x \\ y \end{matrix}]) & (p x + (1 - p) y) \cdot f (\frac{x}{p x + (1 - p) y}) \end{matrix}

(27a)

\begin{matrix} i_{f} (p, κ) & \sum_{\vec{v} \in κ} r_{f} (p, \vec{v}) \end{matrix}

(27b)

Theorem 1

(Properties of

r_{f}

). For a constant

0 \leq p \leq 1

: (1) the function

r_{f} (p, \vec{v})

is convex in

\vec{v}

, (2) scales linearly in

\vec{v}

, (3) satisfies a triangle inequality in

\vec{v}

, (4) quantifies any vector of slope one to zero, and (5) quantifies the zero vector to zero.

Proof.

1.: The convexity of $r_{f} (p, \vec{v})$ in $\vec{v}$ is shown separately in Lemma A1 of Appendix A.
2.: That $r_{f} (p, ℓ \vec{v}) = ℓ r_{f} (p, \vec{v})$ scales linearly in $\vec{v}$ can directly be seen from Equation 27a.
3.: The triangle inequality of $r_{f} (p, \vec{v})$ in $\vec{v}$ is shown separately in Corollary A1 of Appendix A.
4.: A vector of slope one is quantified to zero $r_{f} (p, [\begin{matrix} ℓ \\ ℓ \end{matrix}]) = ℓ \cdot f (1) = 0$ , since $f (1) = 0$ is a requirement on the generator function of an f-divergence (Definition 15).
5.: The zero vector is quantified to zero $r_{f} (p, [\begin{matrix} 0 \\ 0 \end{matrix}]) = 0 \cdot f (\frac{0}{0}) = 0$ by the convention of generator functions for an f-divergence (Definition 15).

□

The function

r_{f}

provides the following properties to the pointwise information measure

i_{f}

Theorem 2

(Properties of

i_{f}

). The pointwise information measure

i_{f}

(1) maintains the ordering relation of the Blackwell order for binary input channels and (2) is non-negative.

Proof.

1.: That the function $r_{f}$ maintains the ordering relation of the Blackwell order on binary input channels is shown separately in Lemma A2 of Appendix A (Equation 28a).
2.: The bottom element $⊥_{BW} = [\begin{matrix} 1 \\ 1 \end{matrix}]$ consists of a single vector of slope one, which is quantified to zero by Theorem 1 (Equation 28b). The combination with Equation 28a ensures the non-negativity.

\begin{matrix} κ_{1} ⊑ κ_{2} ⟹ & i_{f} (p, κ_{1}) \leq i_{f} (p, κ_{2}), \end{matrix}

(28a)

\begin{matrix} i_{f} (p, ⊥_{BW}) = 0 . \end{matrix}

(28b)

□

An f-information corresponds to the expected value of the target pointwise f-information function defined above (Equation 29). As a result, we can interpret f-information as quantifying (half) the expected zonogon perimeter length for the pointwise channels

Z (κ (S, T, t))

, where the function

r_{f}

acts as a pseudo-distance.

\begin{matrix} I_{f} (S; T) & = \sum_{t \in T} P_{T} (t) \cdot i_{f} (P_{T} (t), κ (S, T, t)) \\ = \sum_{t \in T} P_{T} (t) \cdot [\sum_{\vec{v} \in κ (S, T, t)} r_{f} (P_{T} (t), \vec{v})] \\ = \sum_{t \in T} P_{T} (t) \cdot [\sum_{s \in S} P_{S} (s) \cdot f (\frac{P_{S ∣ T} (s ∣ t)}{P_{S} (s)})] \end{matrix}

(29)

3.2. Decomposing f-Information

With the representation of Section 3.1 in mind, we can define a non-negative partial information decomposition for a set of visible variables

V = {V_{1}, . . ., V_{n}}

about a target variable T for any f-information. The decomposition is performed from a pointwise perspective, which means that we decompose the pointwise measure

i_{f}

on the synergy lattice

(A (V), ⪯)

for each

t \in T

. The pointwise synergy lattices are then combined using a weighted sum to obtain the decomposition of

I_{f}

We map each atom of the synergy lattice to the join of pointwise channels for its contained sources.

Definition 24

(From atoms to channels). We define the channel corresponding to an atom

α \in A (V)

as shown in Eqation 30.

κ_{⊔} (α, T, t) \{\begin{matrix} ⊥_{BW} & if α = ⊤_{SL} = \emptyset \\ ⨆_{S \in α} κ (S, T, t)) & otherwise \end{matrix}

(30)

Lemma 1.

For any set of sources

α, β \in P (P_{1} (V))

and target variable T with state

t \in T

, the function

κ_{⊔}

maintains the ordering of the synergy lattice under the Blackwell order as shown in Equation 31.

α ⪯ β ⟹ κ_{⊔} (β, T, t) ⊑ κ_{⊔} (α, T, t)

(31)

Lemma 1 is shown separately in Appendix B.1 of Appendix B. The mapping from Definition 24 provides a lattice that can be quantified using pointwise f-information to construct a cumulative loss measure for its decomposition using the Möbius inverse.

Definition 25

([Target] pointwise cumulative and partial loss measures). We define the target pointwise cumulative and partial loss functions as shown in Equation 32a and 32b.

\begin{matrix} i_{\cup, f} (α, T, t) & i_{f} (P_{T} (t), κ (V, T, t)) - i_{f} (P_{T} (t), κ_{⊔} (α, T, t)) \end{matrix}

(32a)

\begin{matrix} Δ i_{\cup, f} (α, T, t) & i_{\cup, f} (α, T, t) - \sum_{β \in \dot{↓} α} Δ i_{\cup, f} (β, T, t) \end{matrix}

(32b)

The combined cumulative and partial measures are the expected value of their corresponding pointwise measures. This corresponds to combining the pointwise decomposition lattices by a weighted sum.

Definition 26

(Combined cumulative and partial loss measures). The cumulative loss measure

I_{\cup, f}

is defined by Equation 33 and the decomposition result

Δ I_{\cup, f}

by Equation 34.

I_{\cup, f} (α; T) \sum_{t \in T} P_{T} (t) \cdot i_{\cup, f} (α, T, t)

(33)

\begin{matrix} Δ I_{\cup, f} (α; T) & \sum_{t \in T} P_{T} (t) \cdot Δ i_{\cup, f} (α, T, t) \\ = I_{\cup, f} (α; T) - \sum_{β \in \dot{↓} α} Δ I_{\cup, f} (β; T) \end{matrix}

(34)

Theorem 3.

The presented definitions for the pointwise and expected loss measures (

i_{\cup, f}

and

I_{\cup, f}

) provide a non-negative PID on the synergy lattice with inclusion-exclusion relation that satisfies the Axioms 1, 2 and 4 for any f-information measure.

Proof.

Axiom 1: The measure $i_{\cup, f}$ (Equation 32a) is invariant to permuting the order of sources in $α$ , since the join operator of the zonogon order ( $⨆_{S \in α}$ ) is. Therefore, also $I_{\cup, f}$ satisfies Axiom 1.
Axiom 2: The monotonicity of both $i_{\cup, f}$ and $I_{\cup, f}$ on the synergy lattice is shown separately as Corollary A2 in Appendix B.
Axiom 4: For a single source, $i_{\cup, f}$ equals the pointwise information loss by definition (see Equation 25, 27b and 32a). Therefore, $I_{\cup, f}$ satisfies Axiom 4.
Non-negativity: The non-negativity of $Δ i_{\cup, f}$ and $Δ I_{\cup, f}$ is shown separately as Lemma A7 in Appendix B.

□

The function

i_{f} (P_{T} (t), κ_{⊔} (α, T, t))

quantifies the convex hull/blackwell join of the Neyman-Pearson regions of its sources and represents a notion of pointwise union information about the target state

t \in T

. It is used in Equation 32a to define a pointwise loss measure for the synergy lattice. However, we can equally define pointwise redundant (intersection) information through an inclusion-exclusion relation of this union measure. The resulting pointwise and combined cumulative measures (

i_{\cap, f}

and

I_{\cap, f}

) are shown in Equation 35. The partial contributions (

Δ i_{\cap, f}

and

Δ I_{\cap, f}

) are obtained from the Möbius inverse, which results in the corresponding dual decomposition on the redundancy lattice [23]. This conversion between representations (redundancy lattice ↔ synergy lattice) can be applied to any cumulative decomposition measure in the framework of Williams and Beer [1] that satisfies non-negativity.

Definition 27

(Dual decomposition on the redundancy lattice). We define the pointwise and cumulative redundancy measure as shown in Equation 35.

\begin{matrix} i_{\cap, f} (α, T, t) & \sum_{β \in P_{1} (α)} {(- 1)}^{| β | - 1} i_{f} (P_{T} (t), κ_{⊔} (β, T, t)) \end{matrix}

(35a)

\begin{matrix} I_{\cap, f} (α; T) & \sum_{t \in T} P_{T} (t) \cdot i_{\cap, f} (α, T, t) \end{matrix}

(35b)

\begin{matrix} = I_{f} (V; T) - \sum_{β \in P_{1} (α)} {(- 1)}^{| β | - 1} I_{\cup, f} (β; T) \end{matrix}

(35c)

Corollary 1.

The dual decomposition as defined by Equation 27 provides a non-negative PID which satisfies an inclusion-exclusion relation and the axioms of Williams and Beer [1] on the redundancy lattice.

Proof.

The Axioms 1 and 4 are transformed from Theorem 3 by Equation 35c. The non-negativity is obtained from Theorem 3 since the partial contributions are identical between dual decompositions [23]. The non-negativity ensures monotonicity (Axiom 2) since the cumulative measure

I_{\cap, f}

is the sum of (non-negative) partial contributions in its down-set due to the Möbius inverse. □

Remark 7.

The definitions of Equation 33 and 35 satisfy the desired property of Bertschinger et al. [9], who argued that any sensible measure for unique and redundant information should only depend on the marginal distribution of sources.

Remark 8.

As discussed before [20], it is possible to further split redundancy into two components for extracting the pointwise meet under the Blackwell order (zonogon intersection, first component). The second component of redundancy as defined above contains decision regions that are part of the convex hull but not the individual channel zonogons (discussed as shared information in [20]). By combining Equation 35 and Lemma A6, we obtain that both components of this split for redundancy are non-negative.

From a pointwise perspective (

| T | = 2

), there always exists a dependency between the sources for which the synergy of this state becomes zero. This dependence corresponds, by definition, to the join of their channels. This is helpful for the operational interpretation in the following paragraph since, individually, each pointwise synergy becomes fully volatile to the dependence between the sources. There may not exist a dependency between the sources for which the expected synergy becomes zero for

| T | > 2

. However, each decision region that is quantified as synergetic becomes inaccessible at some dependence between the sources.

Operational interpretation: The decomposition obtains the operational interpretation that if a variable provides pointwise unique information, then there exists a unique decision region for some

t \in T

that this variable provides access to. Moreover, if a set of variables provides synergetic information, then a decision region for some

t \in T

may become inaccessible if the dependence between the variables changes. Due to the equivalence of the zonogon and Blackwell order for binary input variables, these interpretations can also be transferred to a set of actions

a \in Ω

and a pointwise reward function

u (a, π (T, t))

, which only depends on one state of the target variable

π (T, t)

(see Section 2.1): If a variable provides unique information, then it provides an advantage for some set of actions and pointwise reward function, while synergy indicates that the advantage for some pointwise reward function is based on the dependence between variables.

The implication of the interpretation does not hold in the other direction, which we will also highlight in the example of

I_{\cup, TV}

in Section 4.1. Finally, the definition of the Blackwell order through the chaining of channels (Equation 2) highlights its suitability for tracing the flows of information in Markov chains (see Section 4.2).

3.3. Decomposing Rényi-Information

Since Rényi-information is an invertible transformation of Hellinger-information and

α

-information, we argue that their decompositions should be consistent. We propose to view the decomposition of Rényi-information as a transformation from an f-information and demonstrate the approach by transferring the Hellinger-information decomposition to a Rényi-information decomposition. Then, we demonstrate that the result is invariant to a linear scaling of the considered f-information, such that the transformation from

α

-information provides identical results. The obtained Rényi-information decomposition is non-negative and satisfies the three axioms proposed by Williams and Beer [1] (see below). However, its inclusion-exclusion relation is based on a transformed addition operator. For transforming the decomposition, we consider Rényi-information to be a re-graduation of Hellinger-information, as shown in Equation 36.

\begin{matrix} v_{a} (z) & \frac{1}{a - 1} log (1 + (a - 1) z) \end{matrix}

(36a)

\begin{matrix} I_{R_{a}} (S; T) & = v_{a} (I_{H_{a}} (S; T)) \end{matrix}

(36b)

To maintain consistency when transforming the measure, we also have to transform its operators ([19], p. 6 ff.):

Definition 28

(Addition of Rényi-information). We define the addition of Rényi-information

\oplus_{a}

with its corresponding inverse function

⊖_{a}

by Equation 37.

\begin{matrix} x \oplus_{a} y & v_{a} (v_{a}^{- 1} (x) + v_{a}^{- 1} (y)) = \frac{log (e^{(a - 1) x} + e^{(a - 1) y} - 1)}{a - 1} \end{matrix}

(37a)

\begin{matrix} x ⊖_{a} y & v_{a} (v_{a}^{- 1} (x) - v_{a}^{- 1} (y)) = \frac{log (e^{(a - 1) x} - e^{(a - 1) y} + 1)}{a - 1} \end{matrix}

(37b)

To transform a decomposition of the synergy lattice, we define the cumulative loss measures as shown in Equation 38 and use the transformed operators when computing the Möbius inverse (Equation 39a) to maintain consistency in the results (Equation 39b).

Definition 29.

The cumulative and partial Rényi-information loss measures are defined as transformations of the cumulative and partial Hellinger-information loss measures, as shown in Equations 38 and 39.

I_{\cup, R_{a}} (α; T) v_{a} (I_{\cup, H_{a}} (α; T))

(38)

\begin{matrix} Δ I_{\cup, R_{a}} (α; T) & I_{\cup, R_{a}} (α; T) ⊖_{a} \sum_{β \in \dot{↓} α} Δ I_{\cup, R_{a}} (β; T) where : + \oplus_{a} \end{matrix}

(39a)

\begin{matrix} = v_{a} (Δ I_{\cup, H_{a}} (α; T)) \end{matrix}

(39b)

Remark 9.

We show in Lemma A8 of Appendix C that re-scaling the original f-information does not affect the resulting decomposition or transformed operators. Therefore, transforming a Hellinger-information decomposition or a α-information decomposition to a Rényi-information decomposition provides identical results.

The operational interpretation presented in Section 3.2 is similarly applicable to partial Rényi-information (

Δ I_{\cup, R_{a}}

, Equation 39b), since the function

v_{a}

satisfies

v_{a} (0) = 0

and

x \leq 0 ⟹ 0 \leq v_{a} (x)

Theorem 4.

The presented definitions for the cumulative loss measure

I_{\cup, R_{a}}

provide a non-negative PID on the synergy lattice with inclusion-exclusion relation under the transformed addition (Definition 28) that satisfies the Axioms 1, 2 and 4 for any Rényi-information measure.

Proof.

Axiom 1: $I_{\cup, R_{a}} (α; T)$ is invariant to permuting the order of sources, since $I_{\cup, H_{a}} (S; T)$ satisfies Axiom 1 (see Section 3.2).
Axiom 2: $I_{\cup, R_{a}} (α; T)$ satisfies monotonicity, since $I_{\cup, H_{a}} (S; T)$ satisfies Axiom 2 (see Section 3.2) and the transformation function $v_{a}$ is monotonically increasing for $a \in (0, 1) \cup (1, \infty)$ .
Axiom 4: Since $I_{\cup, H_{a}}$ satisfies Axiom 4 (see Section 3.2, Equation 36 and 38), $I_{\cup, R_{a}}$ satisfies the self-redundancy axiom by definition, however, at a transformed operator: $I_{\cup, R_{a}} ({S_{i}}; T) = I_{R_{a}} ({V}; T) ⊖_{a} I_{R_{a}} ({S_{i}}; T)$ .
Non-negativity: The decomposition of $I_{\cup, R_{a}}$ is non negative, since $Δ I_{\cup, H_{a}}$ is non-negative (see Section 3.2), the Möbius inverse is computed with transformed operators (Equation 39b) and the function $v_{a}$ satisfies $x \leq 0 ⟹ 0 \leq v_{a} (x)$ .

□

Remark 10.

To obtain an equivalent decomposition of Rényi-information on the redundancy lattice, we can correspondingly transform the dual decomposition from the redundancy lattice of Hellinger-Information as shown in Equation 40. The resulting decomposition will satisfy the non-negativity, axioms of Williams and Beer [1] and an inclusion-exclusion relation under the transformed operators (Definition 28) for the same reasons described above from Corollary 1.

\begin{matrix} I_{\cap, R_{a}} (α; T) & v_{a} (I_{\cap, H_{a}} (α; T)) \end{matrix}

(40a)

\begin{matrix} Δ I_{\cap, R_{a}} (α; T) & v_{a} (Δ I_{\cap, H_{a}} (α; T)) \end{matrix}

(40b)

Remark 11.

The relation between the redundancy and synergy lattice can be used for the definition of a bi-valution [19] in calculations as discussed in [20]. This is also possible for Rényi-information at transformed operators.

When taking the limit of Rényi-information for

a \to 1

, we obtain mutual information (

I_{KL}

). Since mutual information is also an f-information, we expect its operators in the Möbius inverse to be addition. This is indeed the case (Equation 41), and the measures will be consistent.

\begin{matrix} lim_{a \to 1} x \oplus_{a} y & = x + y \\ lim_{a \to 1} x ⊖_{a} y & = x - y \end{matrix}

(41)

Finally, the decomposition of Bhattacharyya-information can be obtained by re-scaling the decomposition of Rényi-information at

a = 0.5

, which causes another transform of the addition operator for the inclusion-exclusion relation.

4. Evaluation

A comparison of the proposed decomposition with other methods of the literature can be found in [20] for mutual information. Therefore, this section first compares different f-information measures at typical decomposition examples and discusses the special case of total variation (TV)-information to explain its distinct behavior. Since we can see larger differences between measures in more complex scenarios, we compare the measures by analyzing the information flows in a Markov chain. We provide the used implementation for both dual decompositions of f-information and the examples used in this work at [30].

4.1. Partial Information Decomposition

4.1.1. Comparison of Different f-Information Measures

We use the examples discussed by Finn and Lizier [13] to compare different f-information decompositions and add a generic example from [20]. All used probability distributions and their abbreviations can be found in Appendix D. We normalize the decomposition results to the f-Entropy of the target variable for the visualization in Figure 7.

Since all results are based on the same framework, they behave similarly at examples that analyze a specific aspect of the decomposition function (XOR, Unq, PwUnq, RdnErr, Tbc, AND). However, it can be observed that the decomposition of total variation (TV) appears to differ from others: (1) In all examples, total variation attributes more information to being redundant than other measures. (2) In the generic example, total variation is the only measure that does not attribute any information to being unique to variable one or synergetic. We discuss the case of total variation in Section 4.1.2 to explain its distinct behavior.

We visualize the zonogons for the generic example in Figure A2, which shall highlight that the implication of the operational interpretation does not hold in the other direction: the existence of partial information implies an advantage for the expected reward towards some state of the target variable, but an advantage for the expected reward towards some state of the target variable does not imply partial information in the example of total variation.

4.1.2. The special case of total variation

The behavior of total variation appears different compared to other f-information measures (Figure 7). This is due to total variation measuring the perimeter of a zonogon such that the result corresponds to a linear scaling of the maximal (Euclidean) height

h^{*}

that the zonogon reaches above the diagonal as visualized in Figure 8.

Remark 12.

From a cost perspective, the height

h^{*}

can be interpreted as performance evaluation of the optimal decision strategy (symmetric point to

P^{*}

in the lower zonogon half) for a prediction

\hat{T}

with minimal expected cost at the cost ratio

\frac{Cos t (T = t, \hat{T} \neq t) - Cos t (T = t, \hat{T} = t)}{Cos t (T \neq t, \hat{T} = t) - Cos t (T \neq t, \hat{T} \neq t)} = \frac{1 - P_{T} (t)}{P_{T} (t)}

(see Equation 8 of [31]) for each target state individually.

Lemma 2.

a): The pointwise total variation ( $i_{TV}$ ) is a linear scaling of the maximal (Euclidean) height $h^{*}$ that the corresponding zonogon reaches above the diagonal, as visualized in Figure 8 (Equation 42a).
b): For a non-empty set of pointwise channels $A$ , pointwise total variation $i_{TV}$ quantifies the join element to the maximum of its individual channels (Equation 42b).
c): The loss measure $i_{\cup, TV}$ quantifies the meet for a set of sources on the synergy lattice to their minimum (Equation 42c).

\begin{matrix} i_{TV} (p, κ) & = \frac{1 - p}{2} \sum_{v \in κ} | v_{x} - v_{y} | = (1 - p) \frac{h^{*}}{\sqrt{2}} \end{matrix}

(42a)

\begin{matrix} i_{TV} (p, ⨆_{κ \in A} κ) & = max_{κ \in A} i_{TV} (p, κ) \end{matrix}

(42b)

\begin{matrix} i_{\cup, TV} (\underset{α \in A}{⋀} α, T, t) & = min_{α \in A} i_{\cup, TV} (α, T, t) \end{matrix}

(42c)

Proof.

The proof of the first two statements (Equation 42b and 42b) is provided separately in Appendix E, which imply the third (Equation 42c) by Definition 25. □

Quantifying the meet element on the synergy lattice to the minimum has the following consequences for total variation: (1) It attributes a minimum amount of synergy, and therefore more information to redundancy than other measures. (2) For each state of the target, at most one variable can provide unique information. In the case of

| T | = 2

, the pointwise channels are symmetric (see Equation 6), such that the same variable provides the maximal zonogon height both times. This is the case in the generic example of Figure 7, and the reason why at most one variable can provide unique information in this setting. However, beyond binary targets (

| T | > 2

), both variables may provide unique information at the same time since different sources can provide the maximal zonogon height for different target states (see later example in Figure 9).

Remark 13.

Using the pointwise minimum on the synergy lattice results in a similar structure to the proposed measure of Williams and Beer [1]. However, TV-information is based on a different pointwise measure

i_{T V}

, which displays the same behavior (Equation 42b), unlike pointwise KL-information.

4.2. Information Flow Analysis

The differences between f-information measures in Section 4.1 appear more visible in complex scenarios. Therefore, this section compares different measures in the information flow analysis of a Markov chain.

Consider a Markov chain

M_{1} \to M_{2} \to \dots \to M_{5}

, where

M_{i} = (X_{i}, Y_{i})

is the joint distribution of two variables. Assume that we are interested in state three and thus define

T = M_{3}

as the target variable. Using the approach described in Section 3, we can compute an information decomposition for each state

M_{i}

of the Markov chain with respect to the target. Now, we are additionally interested in how the partial information decomposition from stage

M_{i}

propagates into the next

M_{i + 1}

, as visualized in Figure 9.

Definition 30

(Partial information flow). The partial information flow of an atom

α \in A (M_{i})

into the atom

β \in A (M_{i + 1})

quantifies the redundancy between the partial contributions of their respective decomposition lattices.

Notation 8.

We use the notation

I_{\circ, f}

with

\circ \in {\cup, \cap}

to refer to either the loss measure

I_{\cup, f}

or redundancy measure

I_{\cap, f}

. The same applies to the functions

J_{\circ \to \circ, f}

and

J_{Δ \to \circ, f}

of Equation 43.

Let

α \in A (M_{i})

and

β \in A (M_{i + 1})

, then we compute information flows equivalently on the redundancy or synergy lattice as shown in Equation 43. When using a redundancy measure

\circ = \cap

, then the strict down-set of

α

refers to the strict down-set on its redundancy lattice

(A (M_{i}), ≼)

and when using a loss measure

\circ = \cup

, then the strict down-set refers to the strict down-set on its synergy lattice

(A (M_{i}), ⪯)

. We obtain the intersection of cumulative measures by quantifying their meet, which is on both lattice equivalent to their union of sources (

J_{\circ \to \circ, f}

, Equation 43a). To obtain how much of the partial contribution of

α

can be found in the cumulative measure of

β

(

J_{Δ \to \circ, f}

), we remove the contributions of its down-set (

\dot{↓} α

on lattice for

A (M_{i})

, see Equation 43b). To finally obtain the flow from the partial contribution of

α

to the partial contribution of

β

(

J_{Δ \to Δ, f}

), we similarly remove the contributions of the down-set of

β

(

\dot{↓} β

on lattice for

A (M_{i + 1})

, see Equation 43c). The approach can be extended for tracing information flows over multiple steps, however, we will only trace one step in this example.

\begin{matrix} J_{\circ \to \circ, f} (α, β, T) & I_{\circ, f} (α \cup β; T) \end{matrix}

(43a)

\begin{matrix} J_{Δ \to \circ, f} (α, β, T) & J_{\circ \to \circ, f} (α, β, T) - \sum_{γ \in \dot{↓} α} J_{Δ \to \circ, f} (γ, β, T) \end{matrix}

(43b)

\begin{matrix} J_{Δ \to Δ, f} (α, β, T) & J_{Δ \to \circ, f} (α, β, T) - \sum_{γ \in \dot{↓} β} J_{Δ \to Δ, f} (α, γ, T) \end{matrix}

(43c)

Remark 14.

The resulting partial information flows are equivalent (dual) between the redundancy and loss measure except for the bottom element since their functionality differs: The flow from or to the bottom element on the redundancy lattice is always zero. In contrast, the flow from or to the bottom element on the synergy lattice quantifies the information gained or lost in the step.

Remark 15.

The information flow analysis of Rényi- and Bhattacharyya-information can be obtained as a transformation of the information flow from Hellinger-information. Alternatively, the information flow can be computed directly using Equation 43 under the corresponding definition of addition and subtraction for the used information measure.

We randomly generate an initial distribution and each row of a transition matrix under the constraint that at least one value shall be above 0.8 to avoid an information decay that is too rapid through the chain. The specific parameters of the example are shown in Appendix F. The used event spaces are

X = {0, 1, 2}

and

Y = {0, 1}

such that

| M_{i} | = 6

. We construct a Markov chain of five steps with the target

T = M_{3}

and trace each partial information for one step using Equation 43. We visualized the results for KL-, TV-, and

χ^{2}

-information in Figure 9, and the results for

H^{2}

-, LC- and JS-information in Figure A3 of Appendix F.

All results display the expected behavior that the information that

M_{i}

provides about

M_{3}

increases for

1 \leq i \leq 3

and decreases for

3 \leq i \leq 5

. The information flow results of KL-,

H^{2}

-, LC-, and JS-information are conceptually similar. Their main differences appear in the rate at which the information decays and, therefore, how much of the total information we can trace. In contrast, the results of TV- and

χ^{2}

-information display different behavior, as shown in Figure 9: TV-information indicates significantly more redundancy, and

χ^{2}

-information displays significantly more synergy than the other measures. Additionally, the decomposition of TV-information contains fever information flows. For example, it is the only analysis that does not show any information flow from

M_{2}

into the unique contribution of

Y_{3}

or from

M_{2}

into the synergy of

(X_{3}, Y_{3})

. This demonstrates that the same decomposition method can obtain different behaviors from different f-divergences.

5. Discussion

Using the Blackwell-order to construct pointwise lattices and to decompose pointwise information is motivated from the following three aspects:

All information measures in Section 2.3 are the expected value of the pointwise information (quantification of the Neyman-Pearson region boundary) for an indicator variable of each target state. Therefore, we argue for acknowledging the “pointwise nature” [13] of these information measures and to decompose them accordingly. A similar argument was made previously by Finn and Lizier [13] for the case of mutual information and motivated their proposed pointwise partial information decomposition.
The Blackwell order does not form a lattice beyond indicator variables since it does not provide a unique meet or join element for $| T | > 2$ [17]. However, from a pointwise perspective, the informativity (Definition 2) provides a unique representation of union information. This enables separating the definition of redundant, unique and synergetic information from a specific information measure, which then only serves for its quantification. We interpret these observations as indication that the Blackwell order should be used to decompose pointwise information based on indicator variables rather than decomposing the expected information based on the full target distribution.
We can consider where the alternative approach would lead, if we decomposed the expected information from the full target distribution using the Blackwell order: the decomposition would become identical to the method of Bertschinger et al. [9] and Griffith and Koch [10]. For bivariate examples ( $| V | = 2$ ), this decomposition [9,10] is non-negative and satisfies an additional property (identity, proposed by Harder et al. [5]). However, the identity property is inconsistent [32] with the Axioms of Williams and Beer [1] and non-negativity for $| V | > 2$ . This causes negative partial information when extending the approach to $| V | > 2$ . The identity property also contradicts the conclusion of Finn and Lizier [13] from studying Kelly Gambling that “information should be regarded as redundant information, regardless of the independence of the information sources” ([13], p. 26). It also contradicts our interpretation of distinct information through distinct decision regions when predicting an indicator variable for some target state. We do not argue that this interpretation should be applicable to the concept of information in general, but acknowledge that this behavior seems present in the information measures studied in this work and construct their decomposition accordingly.

Our critique for the decomposition measure of Williams and Beer [1] focuses on the implication that a less informative variable (Definition 2) about

t \in T

provides less pointwise information (

I (S; T = t)

, Definition 14a):

κ (S_{1}, T, t) ⊑ κ (S_{2}, T, t) ⟹ I (S_{1}; T = t) \leq I (S_{2}; T = t)

. This implication does not hold in the other direction. Therefore, equal pointwise information does not imply equal informativity and thus does not mean being redundant.

We chose to define a notion of pointwise union information based on the join of the Blackwell order since it leads to a meaningful operational interpretation: the convex hull of the pointwise Neyman-Pearson regions is always a subset of their joint distribution. Moreover, it is possible to construct joint distributions for which each individual decision region outside the convex hull becomes inaccessible, even if there may not exist one unique joint distribution at which all synergetic regions are lost simultaneously. This volatility due to the dependence between variables appears suitable for a notion of synergy. Similarly, the resulting unique information appears suitable since it ensures that a variable with unique information must provide access to some additional decision region. Finally, the obtained unique and redundant information is sensible [9] since it only depends on the marginal distributions with the target.

We perform the decomposition on a pointwise lattice using the Blackwell join since it is possible to represent f-information as the expected value of quantifying the Neyman-Pearson region boundary (zonogon perimeter) for indicator variables (pointwise channels). Since the pointwise measures satisfy a triangle inequality, we mentioned the oversimplified intuition of pointwise f-information as length of the zongon perimeter. Correspondingly, if we identified an information measure that behaved more like the area of the zonogon (which could also maintain their ordering), then we would need to decompose it on a pointwise lattice using the Blackwell meet to achieve non-negativity. We assume that most information measures behave more similar to quantifying the boundary length rather than its area, since the boundary segments can directly be obtained from the conditional probability distribution and do not require an actual construction from the likelihood-ratio test.

In the literature, PIDs have been defined based on different ordering relations [16], the Blackwell order being only one of them. We think that this diversity is desirable since each approach provides a different operational interpretation of redundancy and synergy. For this reason, we wonder if obtaining a non-negative decomposition with inclusion-exclusion relation for other ordering relations was possible when transferring them to a pointwise perspective or from mutual information to other information measures.

Studying the relations between different information measures for the same decomposition method may provide further insights into their properties, as demonstrated by the example of total variation in Section 4.2. The ability to decompose different information measures is also a necessity to apply the method in a variety of areas, since each information measure can then provide the operational meaning within its respective domains. To ensure consistency between related information measures, we allowed the re-definition of information addition, as demonstrated in the example of Rényi-information in Section 3.3, which also opens new possibilities for satisfying the inclusion-exclusion relation.

There is currently no universally accepted definition of conditional Rényi information. Assuming that

I_{R_{a}} (T; S_{i} ∣ S_{j})

should capture the information that

S_{i}

provides about T when already knowing the information from

S_{j}

, then one could propose that this quantity should correspond to the according partial information contributions (unique/synergetic) and thus the definition of Equation 44.

With this in mind, it is also possible to define, model, decompose and trace Transfer Entropy [33], used in the analysis of complex systems, for each presented information measure with the methodology of Section 4.2.

I_{R_{a}} (T; S_{i} ∣ S_{j}) I_{R_{a}} (T; S_{i}, S_{j}) ⊖ I_{R_{a}} (T; S_{j})

(44)

Finally, studying the corresponding definitions for continuous random variables and identifying suitable information measures for specific applications would be interesting directions for future work.

6. Conclusions

In this work, we demonstrated a non-negative PID in the framework of Williams and Beer [1] for any f-information with practical operational interpretation. We demonstrated that the decomposition of f-information can be used to obtain a non-negative decomposition of Rényi-information, for which we re-defined the addition to demonstrate that its results satisfy an inclusion-exclusion relation. Finally, we demonstrated how the proposed decomposition method can be used for tracing the flow of information through Markov chains and how the decomposition obtains different properties depending on the chosen information measure.

Funding: This research was funded by Swedish Civil Contingencies Agency (MSB) through the project RIOT grant number MSB 2018-12526.

Appendix A Quantifying Zonogon Perimeters

Lemma A1.

If the function f is convex, then the function

r_{f} (p, \vec{v})

as defined in Equation 27a is convex in its second argument (

\vec{v}

) for a constant

p \in [0, 1]

and

\vec{v} \in {[0, 1]}^{2}

Proof.

We use the following definitions for abbreviating the notation. Let

0 \leq t \leq 1

and

{\vec{v}}_{i} = [\begin{matrix} x_{i} \\ y_{i} \end{matrix}]

\begin{matrix} a_{1} & : = & x_{1} p + y_{1} (1 - p) \\ a_{2} & : = & x_{2} p + y_{2} (1 - p) \\ b_{1} & : = & \frac{t a_{1}}{t a_{1} + (1 - t) a_{2}} \\ b_{2} & : = & \frac{(1 - t) a_{2}}{t a_{1} + (1 - t) a_{2}} \end{matrix}

The case of

a_{i} = 0

is handled by the convention that

0 \cdot f (\frac{0}{0}) = 0

. Therefore, we can assume that

a_{i} \neq 0

and use

0 \leq b_{1} \leq 1

with

b_{2} = 1 - b_{1}

to apply the definition of convexity on the function f:

\begin{matrix} r_{f} (p, [\begin{matrix} t x_{1} + (1 - t) x_{2} \\ t y_{1} + (1 - t) y_{2} \end{matrix}]) & = (t a_{1} + (1 - t) a_{2}) \cdot f (\frac{t x_{1} + (1 - t) x_{2}}{t a_{1} + (1 - t) a_{2}}) \\ = (t a_{1} + (1 - t) a_{2}) \cdot f (b_{1} \frac{x_{1}}{a_{1}} + b_{2} \frac{x_{2}}{a_{2}}) \\ \leq (t a_{1} + (1 - t) a_{2}) \cdot (b_{1} f (\frac{x_{1}}{a_{1}}) + b_{2} f (\frac{x_{2}}{a_{2}})) & (by convexity of f) \\ = t a_{1} \cdot f (\frac{x_{1}}{a_{1}}) + (1 - t) a_{2} \cdot f (\frac{x_{2}}{a_{2}}) \\ = t \cdot r_{f} (p, [\begin{matrix} x_{1} \\ y_{1} \end{matrix}]) + (1 - t) \cdot r_{f} (p, [\begin{matrix} x_{2} \\ y_{2} \end{matrix}]) \end{matrix}

□

Corollary A1.

For a constant

p \in [0, 1]

and

{\vec{v}}_{1}, {\vec{v}}_{2}, ({\vec{v}}_{1} + {\vec{v}}_{2}) \in {[0, 1]}^{2}

, the function

r_{f} (p, \vec{v})

as defined in Equation 27a satisfies a triangle inequality on its second argument:

r_{f} (p, {\vec{v}}_{1} + {\vec{v}}_{2}) \leq r_{f} (p, {\vec{v}}_{1}) + r_{f} (p, {\vec{v}}_{2})

Proof.

\begin{matrix} r_{f} (p, ℓ {\vec{v}}_{1} + (1 - ℓ) {\vec{v}}_{2}) & \leq ℓ r_{f} (p, {\vec{v}}_{1}) + (1 - ℓ) r_{f} (p, {\vec{v}}_{2}) & (be Lemma A 1) \\ r_{f} (p, 0.5 ({\vec{v}}_{1} + {\vec{v}}_{2})) & \leq 0.5 (r_{f} (p, {\vec{v}}_{1}) + r_{f} (p, {\vec{v}}_{2})) & (let ℓ = 0.5) \\ r_{f} (p, {\vec{v}}_{1} + {\vec{v}}_{2}) & \leq r_{f} (p, {\vec{v}}_{1}) + r_{f} (p, {\vec{v}}_{2}) & (by r_{f} (p, ℓ \vec{v}) = ℓ r_{f} (p, \vec{v})) \end{matrix}

□

Lemma A2.

For a constant

p \in [0, 1]

, the function

i_{f}

maintains the ordering relation from the Blackwell order on binary input channels:

κ_{1} ⊑ κ_{2} ⟹ i_{f} (p, κ_{1}) \leq i_{f} (p, κ_{2})

Proof.

Let

κ_{1}

be represented by a

2 \times n

matrix and

κ_{2}

by a

2 \times m

matrix. By the definition of the Blackwell order (

κ_{1} ⊑ κ_{2}

, Equation 2), there exists a stochastic matrix

λ

such that

κ_{1} = κ_{2} \cdot λ

. We use the notation

κ_{2} [:, i]

to refer to the

i^{th}

column of matrix

κ_{2}

and indicate the element at row

i \in {1 . . m}

and column

j \in {1 . . n}

λ

λ [i, j]

. Since

λ

is a valid (row) stochastic matrix of dimension

m \times n

, its rows sum to one

\forall i \in {1 . . m} . \sum_{j = 1}^{n} λ [i, j] = 1

\begin{matrix} i_{f} (p, κ_{1}) & = \sum_{j = 1}^{n} r_{f} (p, κ_{1} [:, j]) & (by Equation 27 b) \\ = \sum_{j = 1}^{n} r_{f} (p, \sum_{i = 1}^{m} κ_{2} [:, i] λ [i, j]) & (by Equation 2) \\ \leq \sum_{j = 1}^{n} \sum_{i = 1}^{m} r_{f} (p, κ_{2} [:, i] λ [i, j]) & (by Corollary A 1) \\ = \sum_{j = 1}^{n} \sum_{i = 1}^{m} λ [i, j] r_{f} (p, κ_{2} [:, i]) & (by r_{f} (p, ℓ \vec{v}) = ℓ r_{f} (p, \vec{v})) \\ = \sum_{i = 1}^{m} r_{f} (p, κ_{2} [:, i]) & (by \sum_{j = 1}^{n} λ [i, j] = 1) \\ = i_{f} (p, κ_{2}) & (by Equation 27 b) \end{matrix}

□

Lemma A3.

Consider two non-empty sets of binary input channels with equal cardinality (

| A | = | B |

) and a constant

p \in [0, 1]

. If the Minkowski sum for the zonogons of channels in

A

is a subset of the Minkowski sum for the zonogons of channels in

B

, then the sum of pointwise information for the channels in

A

is less than the sum of pointwise information for the channels in

B

as shown in Equation A1.

\sum_{κ \in A} Z (κ) \subseteq \sum_{κ \in B} Z (κ) ⟹ \sum_{κ \in A} i_{f} (p, κ) \leq \sum_{κ \in B} i_{f} (p, κ)

(A1)

Proof.

Let

n = | A | = | B |

. We use the notation

A [i]

with

1 \leq i \leq n

to indicate the channel

κ_{i}

within the set

A

\begin{matrix} \sum_{i = 1}^{n} Z (A [i]) & \subseteq \sum_{i = 1}^{n} Z (B [i]) \\ Z ([\begin{matrix} A [1] & \dots & A [n] \end{matrix}]) & \subseteq Z ([\begin{matrix} B [1] & \dots & B [n] \end{matrix}]) & (by Equation 4) \\ Z (\frac{1}{n} \cdot [\begin{matrix} A [1] & \dots & A [n] \end{matrix}]) & \subseteq Z (\frac{1}{n} \cdot [\begin{matrix} B [1] & \dots & B [n] \end{matrix}]) & (scale to sum (1, 1)) \\ i_{f} (p, \frac{1}{n} \cdot [\begin{matrix} A [1] & \dots & A [n] \end{matrix}]) & \leq i_{f} (p, \frac{1}{n} \cdot [\begin{matrix} B [1] & \dots & B [n] \end{matrix}]) & (by Eq . 5, Lem . A 2) \\ \sum_{i = 1}^{n} i_{f} (p, \frac{1}{n} A [i]) & \leq \sum_{i = 1}^{n} i_{f} (p, \frac{1}{n} B [i]) & (by Equation 27 b) \\ \frac{1}{n} \sum_{i = 1}^{n} i_{f} (p, A [i]) & \leq \frac{1}{n} \sum_{i = 1}^{n} i_{f} (p, B [i]) & (by r_{f} (p, ℓ \vec{v}) = ℓ r_{f} (p, \vec{v})) \\ \sum_{κ \in A} i_{f} (p, κ) & \leq \sum_{κ \in B} i_{f} (p, κ) \end{matrix}

□

Appendix B The Non-Negativity of Partial f-Information

The proof of non-negativity can be divided into three parts. First, we show that the loss measure maintains the ordering relation of the synergy lattice and how the quantification of a meet element

i_{\cup, f} (α \land β, T, t)

can be computed. Second, we demonstrate the construction of a bijective mapping between all subsets of even and odd cardinality that maintains a required subset relation for any selection function. Finally, we combine these two results to demonstrate that an inclusion-exclusion relation using the convex hull of zonogons is greater than their intersection and obtain the non-negativity of the decomposition by transitivity.

Appendix B.1. Properties of the Loss Measure on the Synergy Lattice

We require some of the following properties to hold for any set of sources

α \in P (P_{1} (V))

. Therefore, we define an equivalence relation from the ordering of the synergy lattice (≅) as shown in Equation A2.

Notation A1

(Equivalence under synergy-ordering). We use the notation

α ≅ β

for the equivalence of two sets of sources

α, β \in P (P_{1} (V))

on the synergy lattice.

(α ≅ β) ⟺ (α ⪯ β and β ⪯ α)

(A2)

Lemma A4.

Any set of sources

α \in P (P_{1} (V))

is equivalent (≅) to some atom of the synergy lattice

γ \in A (V)

\forall α \in P (P_{1} (V)) . \exists γ \in A (V) . γ ≅ α

The union for two sets of sources is equivalent to the meet of their corresponding atoms on the synergy lattice. Let

α, β \in P (P_{1} (V))

and

γ, δ \in A (V)

\begin{matrix} γ ≅ α and δ ≅ β ⟹ (γ \land δ) ≅ (α \cup β) \end{matrix}

Proof.

The used filter in the definition of an atom (

A (V) \subseteq P (P_{1} (V))

, Equation 8) only removes sets of cardinality

2 \leq | α |

and for any removed set of sources, we can construct an equivalent set which contains one less source by removing the subset

S_{a} \subset S_{b}

as shown in Equation A3a. Therefore, all sets of sources

α \in P (P_{1} (V))

are equivalent to some atom

γ \in A (V)

within the lattice (Equation ).

\begin{matrix} S_{a} \subset S_{b} ⟹ α & ≅ (α ∖ S_{a}) & where : S_{a}, S_{b} \in α \end{matrix}

(A3a)

\begin{matrix} \forall α \in P (P_{1} (V)), \exists γ \in A (V) . α & ≅ γ \end{matrix}

(A3b)

The union of two sets of sources

α \in P (P_{1} (V))

is inferior to each individual set

α

and

β

\begin{matrix} (α \cup β) & ⪯ α & (by Equation 10) \\ (α \cup β) & ⪯ β & (by Equation 10) \end{matrix}

All sets of sources

ε \in P (P_{1} (V))

that are inferior of both

α

and

β

(

ε ⪯ α

and

ε ⪯ β

) are also inferior to their union.

\begin{matrix} ε ⪯ α and ε ⪯ β & ⟹ ε ⪯ (α \cup β) & (by Equation 10) \end{matrix}

Therefore, the union of

α

and

β

is equivalent to the meet of their corresponding atoms on the synergy lattice. □

Proof of Lemma 1 from Section 3.2:

For any set of sources

α, β \in P (P_{1} (V))

and target variable T with state

t \in T

, the function

κ_{⊔}

(Equation 30) maintains the ordering from the synergy lattice under the Blackwell order.

α ⪯ β ⟹ κ_{⊔} (β, T, t) ⊑ κ_{⊔} (α, T, t)

(A4)

Proof.

We consider two cases for

β

1.: If $β = \emptyset$ , then the implication holds for any $α$ since the bottom element $κ_{⊔} (\emptyset, T, t) = ⊥_{BW}$ is inferior (⊑) to any other channel.
2.: If $β \neq \emptyset$ , then $α$ is also a non-empty set since $α ⪯ β ≺ ⊤_{SL} = \emptyset$ .

$\begin{matrix} α & ⪯ β \\ \forall S_{b} \in β, \exists S_{a} \in α . & S_{b} & \subseteq S_{a} & (by Equation 10) \\ \forall S_{b} \in β, \exists S_{a} \in α . & κ (S_{b}, T, t) & ⊑ κ (S_{a}, T, t) & (by Equation 2) \\ ⨆_{S_{b} \in β} κ (S_{b}, T, t) & ⊑ ⨆_{S_{a} \in α} κ (S_{a}, T, t) \\ κ_{⊔} (β, T, t) & ⊑ κ_{⊔} (α, T, t) \end{matrix}$

Since the implication holds for both cases, the ordering is maintained. □

Corollary A2.

The defined cumulative loss measures (

i_{\cup, f}

of Equation 32a and

I_{\cup, f}

of Equation 33) maintain the ordering relation of the synergy lattice for any set of sources

α, β \in P (P_{1} (V))

and target variable T with state

t \in T

\begin{matrix} α ⪯ β ⟹ & i_{\cup, f} (α, T, t) \leq i_{\cup, f} (β, T, t) \\ α ⪯ β ⟹ & I_{\cup, f} (α; T) \leq I_{\cup, f} (β; T) \end{matrix}

Proof.

The pointwise monotonicity of the cumulative loss measure (

α ⪯ β ⟹ i_{\cup, f} (α, T, t) \leq i_{\cup, f} (β, T, t)

) is obtained from Lemma 1 and A2 with Equation 32a. Sine all cumulative pointwise losses

i_{\cup, f}

are smaller for

α

than

β

, so will be their weighted sum (

α ⪯ β ⟹ I_{\cup, f} (α; T) \leq I_{\cup, f} (β; T)

, see Equation 33). □

Corollary A3.

The cumulative pointwise loss of the meet from two atoms is equivalent to the cumulative pointwise loss of their union for any target variable T with state

t \in T

i_{\cup, f} (α \land β, T, t) = i_{\cup, f} (α \cup β, T, t)

Proof.

The result follows from Lemma A4 and Corollary A2. □

Appendix B.2. Mapping subsets of even and odd cardinality

Let

P (A)

represent the power-set of a non-empty set

A \neq \emptyset

and separate the subsets of even (

L_{e}

) and odd (

L_{o}

) cardinality as shown below. Additionally, let

L_{\leq 1}

represent all subsets with cardinality less or equal to one and

L_{1}

all subsets of cardinality equal to one:

\begin{matrix} L_{\leq 1} & {B \in P (A) : | B | \leq 1} \\ L_{1} & {B \in P (A) : | B | = 1} \\ L_{e} & {B \in P (A) : | B | e v e n} \\ L_{o} & {B \in P (A) : | B | o d d} \\ P (A) & = L_{e} \cup L_{o} a n d \emptyset = L_{e} \cap L_{o} \end{matrix}

(A5)

The number of subsets with even cardinality is equal to the number of subsets with odd cardinality as shown in Equation A6.

| L_{e} | = \sum_{i = 0}^{⌊\frac{| A |}{2}⌋} (\binom{| A |}{2 i}) = 2^{| A | - 1} = \sum_{i = 0}^{⌊\frac{| A |}{2}⌋} (\binom{| A |}{2 i + 1}) = | L_{o} |

(A6)

Consider a function

g_{e} : L_{e} \to L_{\leq 1}

, which takes an even subset

E \in L_{e}

and returns a subset of cardinality

| g_{e} (E) | = min (| E |, 1)

according to Equation A7.

\forall E \in S_{e} : \{\begin{matrix} g_{e} (E) = \emptyset & i f E = \emptyset \\ g_{e} (E) = {e} s . t . e \in E & otherwise \end{matrix}

(A7)

Lemma A5.

For any function

g_{e} \in G_{e}

, there exists a function

G : (L_{e}, G_{e}) \to L_{o}

which satisfies the following two properties:

a): For any subset with even cardinality, the function $g_{e} (\cdot)$ returns a subset of function $G (\cdot)$ :

$\forall g_{e} \in G_{e}, E \in L_{e} : g_{e} (E) \subseteq G (E, g_{e}) .$

(A8)
b): The function $G (\cdot)$ which satisfies Equation A8 has an inverse on its first argument $G^{- 1} : (L_{o}, G_{e}) \to L_{e}$ .

$\forall g_{e} \in G_{e}, E \in L_{e}, \exists G^{- 1} : G^{- 1} (G (E, g_{e}), g_{e}) = E .$

(A9)

Figure A1. Intuition for the definition of Equation A11. We can divide the set

P (A_{n + 1})

into

P (A_{n})

and

{B \cup {q} : B \in P (A_{n})}

. The definition of function

G_{n + 1}

mirrors

G_{n}

g_{e} (O \cup {q}) = {q}

(blue) and otherwise breaks its mapping (orange).

Figure A1. Intuition for the definition of Equation A11. We can divide the set

P (A_{n + 1})

into

P (A_{n})

and

{B \cup {q} : B \in P (A_{n})}

. The definition of function

G_{n + 1}

mirrors

G_{n}

g_{e} (O \cup {q}) = {q}

(blue) and otherwise breaks its mapping (orange).

Proof.

We construct a function G for an arbitrary

g_{e}

and demonstrate that it satisfies both properties (Equation A8 and A9) by induction on the cardinality of

A

. We indicate the cardinality of

A

with

n = | A |

as subscript

A_{n}

L_{e, n}

L_{o, n}

and

G_{n}

1.

At the base case

A_{1} = {a}

, the sets of subsets are

L_{e, 1} = {\emptyset}

and

L_{o, 1} = {{a}}

. We define the function

G_{1} (\emptyset, g_{e}) {a}

for any

g_{e}

to satisfy both required properties:

a): The constraints of Equation A7 ensures that $g_{e} (\emptyset) = \emptyset$ . Since the empty set is the only element in $S_{e, 1}$ , the subset relation (requirement of Equation A8) is satisfied $g_{e} (\emptyset) = \emptyset \subseteq {a} = G_{1} (\emptyset, g_{e})$ .
b): The function $G_{1} : (L_{e, 1}, G_{e}) \to L_{o, 1}$ is a bijection from $L_{e, 1}$ to $L_{o, 1}$ and therefore has an inverse on its first argument $G_{1}^{- 1} : (L_{o, 1}, G_{e}) \to L_{e, 1}$ (requirement of Equation A9).

2.

Assume there exists a function

G_{n}

, which satisfies both required properties (Equation A8 and A9) at sets of cardinality

1 \leq n = | A_{n} |

3.

For the induction step, we show the definition of a function

G_{n + 1}

that satisfies both required properties. For sets

A_{n + 1} = A_{n} \cup {q}

, the subsets of even and odd cardinality can be expanded as shown in Equation A10.

\begin{matrix} L_{e, n + 1} & = L_{e, n} \cup \{O \cup {q} : O \in L_{o, n}\}, \\ L_{o, n + 1} & = L_{o, n} \cup \{E \cup {q} : E \in L_{e, n}\} . \end{matrix}

(A10)

We define

G_{n + 1}

for

E \in L_{e, n}

and

O \in L_{o, n}

at any

g_{e}

as shown in Equation A11 using the function

G_{n}

and its inverse

G_{n}^{- 1}

from the induction hypothesis. The function

G_{n + 1}

is defined for any subset in

L_{e, n + 1}

as it can be seen from Equation A10.

\begin{matrix} G_{n + 1} (E, g_{e}) & \{\begin{matrix} E \cup {q} & i f g_{e} (G_{n} (E, g_{e}) \cup {q}) \neq {q} \\ G_{n} (E, g_{e}) & i f g_{e} (G_{n} (E, g_{e}) \cup {q}) = {q} \end{matrix} \\ G_{n + 1} (O \cup {q}, g_{e}) & \{\begin{matrix} O & i f g_{e} (O \cup {q}) \neq {q} \\ G_{n}^{- 1} (O, g_{e}) \cup {q} & i f g_{e} (O \cup {q}) = {q} \end{matrix} \end{matrix}

(A11)

Figure A1 provides an intuition for the definition of

G_{n + 1}

: the outcome of

g_{e} (O \cup {q})

determines, if the function

G_{n + 1}

maintains or breaks the mapping of

G_{n}

The function F as defined in Equation A11 satisfies both requirements (Equation A8 and A9) for any

g_{e}

a)

To demonstrate that the function satisfies the subset relation of Equation A8, we analyze the four cases for the return value of

G_{n + 1}

as defined in Equation A11 individually:

-: $g_{e} (E) \subseteq E \cup {q}$ holds, since the function $g_{e}$ always returns a subset of its input (Equation A7).
-: $g_{e} (E) \subseteq G_{n} (E, g_{e})$ holds by the induction hypothesis.
-: if $g_{e} (O \cup {q}) \neq {q}$ then $g_{e} (O \cup {q}) \subseteq O$ : Since the input to function $g_{e}$ is not the empty set, the function $g_{e} (O \cup {q})$ returns a singleton subset of its input (Equation A7). If the element in the singleton subset is unequal to q, then it is a subset of $O$ .
-: if $g_{e} (O \cup {q}) = {q}$ then $g_{e} (O \cup {q}) \subseteq {q} \cup G_{n}^{- 1} (O, g_{e})$ holds trivially.

b)

To demonstrate that the function

G_{n + 1}

has an inverse (Equation A9), we show that the function

G_{n + 1}

is a bijection from

L_{e, n + 1}

L_{o, n + 1}

. Since the function

G_{n + 1}

is defined for all elements in

L_{e, n + 1}

and both sets have the same cardinality (

| L_{e, n + 1} | = | L_{o, n + 1} |

, Equation A6), it is sufficient to show that the function

G_{n + 1}

is distinct for all inputs.

The return value of

G_{n + 1}

has four cases, two of which return a set containing q (case 1 and 4 in Equation A11), while the two others do not (case 2 and 3 in Equation A11). Therefore, we have to show that both of these cases cannot coincide for any input:

-: Case 2 and 3 in Equation A11: If the return value of both cases was equal, then $O = G_{n} (E, g_{e})$ and therefore $g_{e} (O \cup {q}) = g_{e} (G_{n} (E, g_{e}) \cup {q})$ . This leads to a contradiction, since the condition of case 3 ensures $g_{e} (O \cup {q}) \neq {q}$ , while the condition of case 2 ensures $g_{e} (G_{n} (E, g_{e}) \cup {q}) = {q}$ . Hence, the return values of case 2 and 3 are distinct.
-: Case 1 and 4 in Equation A11: If the return value of both cases was equal, then $E = G_{n}^{- 1} (O, g_{e})$ and therefore $g_{e} (O \cup {q}) = g_{e} (G_{n} (E, g_{e}) \cup {q})$ . This leads to a contradiction, since the condition of case 4 ensures $g_{e} (O \cup {q}) = {q}$ , while the condition of case 1 ensures $g_{e} (G_{n} (E, g_{e}) \cup {q}) \neq {q}$ . Hence, the return values of case 1 and 4 are distinct.

Since the function

G_{n + 1}

is a bijection, there exists an inverse

G_{n + 1}^{- 1}

□

Appendix B.3. The Non-Negativity of the Decomposition

Lemma A6.

Consider a non-empty set of of binary input channel

A \neq \emptyset

and

0 \leq p \leq 1

. Quantifying an inclusion-exclusion principle on the pointwise information of their Blackwell join is larger than the pointwise information of their Blackwell meet as shown in Equation A12.

i_{f} (p, ⊓_{κ \in A} κ) \leq \sum_{\emptyset \neq B \subseteq A} {(- 1)}^{| B | - 1} i_{f} (p, ⨆_{κ \in B} κ)

(A12)

Proof.

Consider a function

g_{o} : L_{o} \to L_{1}

, where

g_{o} (O) \subseteq O

such that the function returns a singleton subset for a set of odd cardinality. Equation A13 can be obtained from the constraints on

g_{e}

(Equation A7) and Lemma A5.

\forall g_{e} \in G_{e}, E \in L_{e}, \exists g_{o} \in G_{o}, G : \{\begin{matrix} g_{e} (\emptyset) \subseteq g_{o} (G (\emptyset)) & i f E = \emptyset \\ g_{e} (E) = g_{o} (G (E)) & otherwise \end{matrix}

(A13)

Equation A14a holds since we can replace

g_{e} (\emptyset)

with

g_{o} (G (\emptyset))

, meaning there exists a

κ \in A

for creating a (Minkowski) sum over the same set of channel zonogons on both sides of the quality. Equation A14b holds since Lemma A5 ensured that the existing function G is a bijection. Equation holds since the intersection is a subset of each individual zonogon.

\begin{matrix} \forall g_{e} \in G_{e}, \exists g_{o} \in G_{o}, κ \in A, G & : Z (κ) + \sum_{E \in L_{e} ∖ \emptyset} Z (g_{e} (E)) = \sum_{E \in L_{e}} Z (g_{o} (G (E))) \end{matrix}

(A14a)

\begin{matrix} \forall g_{e} \in G_{e}, \exists g_{o} \in G_{o}, κ \in A & : Z (κ) + \sum_{E \in L_{e} ∖ \emptyset} Z (g_{e} (E)) = \sum_{O \in L_{o}} Z (g_{o} (O)) \end{matrix}

(A14b)

\begin{matrix} \forall g_{e} \in G_{e}, \exists g_{o} \in G_{o} & : ⋂_{κ \in A} Z (κ) + \sum_{E \in L_{e} ∖ \emptyset} Z (g_{e} (E)) \subseteq \sum_{O \in L_{o}} Z (g_{o} (O)) \end{matrix}

(A14c)

Equation A14c is parameterized by

g_{e}

and subsets are closed under set union. Therefore, we can combine all choices for

g_{e}

and

g_{o}

using the set-theoretic union as shown below. For the notation, let

m = 2^{| A | - 1}

and we indicate subsets of

A

with even cardinality as

E_{i} \in L_{e}

, where

1 \leq i \leq m

. We use the last index for the empty set

E_{m} = \emptyset

. The subsets of

A

with odd cardinality are correspondingly noted as

O_{i} \in L_{o}

. For clarity, we note binary input channels from an even subset as

λ \in E

and binary input channels from an odd subset as

ν \in O

\begin{matrix} ⋃_{\begin{matrix} λ_{1} \in E_{1} \\ λ_{2} \in E_{2} \\ \dots \\ λ_{m - 1} \in E_{m - 1} \end{matrix}} (⋂_{κ \in A} Z (κ) + \sum_{i = 1}^{m - 1} Z (λ_{i})) & \subseteq ⋃_{\begin{matrix} ν_{1} \in O_{1} \\ ν_{2} \in O_{2} \\ \dots \\ ν_{m} \in O_{m} \end{matrix}} (\sum_{j = 1}^{m} Z (ν_{j})) \\ ⋂_{κ \in A} Z (κ) + \sum_{i = 1}^{m - 1} ⋃_{λ \in E_{i}} Z (λ) & \subseteq \sum_{j = 1}^{m} ⋃_{ν \in O_{j}} Z (ν) & (\begin{matrix} Minkowski sum dis - \\ tributes over set union \end{matrix}) \\ C o n v (⋂_{κ \in A} Z (κ) + \sum_{i = 1}^{m - 1} ⋃_{λ \in E_{i}} Z (λ)) & \subseteq C o n v (\sum_{j = 1}^{m} ⋃_{ν \in O_{j}} Z (ν)) & (\begin{matrix} if X \subseteq Y then \\ C o n v (X) \subseteq C o n v (Y) \end{matrix}) \\ ⋂_{κ \in A} Z (κ) + \sum_{i = 1}^{m - 1} C o n v (⋃_{λ \in E_{i}} Z (λ)) & \subseteq \sum_{j = 1}^{m} C o n v (⋃_{ν \in O_{j}} Z (ν)) & (\begin{matrix} Convex hull distributes \\ over Minkowski sum \end{matrix}) \\ Z (⊓_{κ \in A} κ) + \sum_{i = 1}^{m - 1} Z (⨆_{λ \in E_{i}} λ) & \subseteq \sum_{j = 1}^{m} Z (⨆_{ν \in O_{j}} ν) & (by Equation 7) \\ i_{f} (p, ⊓_{κ \in A} κ) + \sum_{i = 1}^{m - 1} i_{f} (p, ⨆_{λ \in E_{i}} λ) & \leq \sum_{j = 1}^{m} i_{f} (p, ⨆_{ν \in O_{j}} ν) & (by Lemma A 3) \\ i_{f} (p, ⊓_{κ \in A} κ) + \sum_{\overset{\emptyset \neq B \subseteq A}{| B | e v e n}} i_{f} (p, ⨆_{κ \in B} κ) & \leq \sum_{\overset{\emptyset \neq B \subseteq A}{| B | o d d}} i_{f} (p, ⨆_{κ \in B} κ) & (replace notation) \\ i_{f} (p, ⊓_{κ \in A} κ) \leq \sum_{\emptyset \neq B \subseteq A} & {(- 1)}^{| B | - 1} i_{f} (p, ⨆_{κ \in B} κ) \end{matrix}

□

Lemma A7.

The decomposition of f-information is non-negative on the pointwise and combined synergy lattice for any target variable T with state

t \in T

\begin{matrix} \forall α \in A (V) . & 0 \leq Δ i_{\cup, f} (α, T, t), \\ \forall α \in A (V) . & 0 \leq Δ I_{\cup, f} (α; T) . \end{matrix}

Proof.

We show the non-negativity of pointwise partial information (

Δ i_{\cup, f} (α, T, t)

) in two cases . We write

α^{-}

to represent the cover set of

α

on the synergy lattice and use

p = P_{T} (t)

as abbreviation:

1.: Let $α = ⊥_{SL} = {V}$ . The bottom element of the synergy lattice is quantified to zero (by Equation 32a, $i_{\cup, f} (⊥_{SL}, T, t) = 0$ ) and therefore also its partial contribution will be zero ( $Δ i_{\cup, f} (⊥_{SL}, T, t) = 0$ ), which implies Equation A15.

$α = ⊥_{SL} ⟹ 0 \leq Δ i_{\cup, f} (α, T, t)$

(A15)
2.: Let $α \in A (V) ∖ {⊥_{SL}}$ , then its cover set is non-empty ( $α^{-} \neq \emptyset$ ). Additionally, we know that no atom in the cover set is the empty set ( $\forall β \in α^{-} . β \neq \emptyset$ ), since the empty atom is the top element ( $⊤_{SL} = \emptyset$ ).

Since it will be required later, note that the inclusion-exclusion principle of a constant is the constant itself as shown in Equation A16 since without the empty set there exists one more subset of odd cardinality than with even cardinality (see Equation A6).

$i_{f} (p, κ (V, T, t)) = \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} i_{f} (p, κ (V, T, t))$

(A16)

We can re-write the Möbious inverse as shown in Equation A17.

$\begin{matrix} Δ i_{\cup, f} (α, T, t) = & i_{\cup, f} (α, T, t) - \sum_{β \in \dot{↓} α} Δ i_{\cup, f} (β, T, t) & (by Equation 32 b) \end{matrix}$

(A17a)

$\begin{matrix} = & i_{\cup, f} (α, T, t) - \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} \cdot i_{\cup, f} (\underset{β \in B}{⋀} β, T, t) & (by [23], p . 15) \end{matrix}$

(A17b)

$\begin{matrix} = & i_{\cup, f} (α, T, t) - \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} \cdot i_{\cup, f} (⋃_{β \in B} β, T, t) & (by Corollary A 3) \end{matrix}$

(A17c)

$\begin{matrix} = & - i_{f} (p, κ_{⊔} (α, T, t)) + \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} \cdot i_{f} (p, κ_{⊔} (⋃_{β \in B} β, T, t)) & (by Eq . 32 a, A 16) \end{matrix}$

(A17d)

$\begin{matrix} = & - i_{f} (p, κ_{⊔} (α, T, t)) + \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} \cdot i_{f} (p, ⨆_{S \in (⋃_{β \in B} β)} κ (S, T, t)) & (by \forall β \in α^{-} . β \neq \emptyset) \end{matrix}$

(A17e)

$\begin{matrix} = & - i_{f} (p, κ_{⊔} (α, T, t)) + \sum_{\emptyset \neq B \subseteq α^{-}} {(- 1)}^{| B | - 1} \cdot i_{f} (p, ⨆_{β \in B} ⨆_{S \in β} κ (S, T, t)) \end{matrix}$

(A17f)

$\begin{matrix} = & - i_{f} (p, κ_{⊔} (α, T, t)) + \sum_{\emptyset \neq B \subseteq {κ_{⊔} (β, T, t) : β \in α^{-}}} {(- 1)}^{| B | - 1} \cdot i_{f} (p, ⨆_{κ \in B} κ) \end{matrix}$

(A17g)

Consider the non-empty set of channels $D = {κ_{⊔} (β, T, t) : β \in α^{-}}$ , then we obtain Equation A18b from Lemma A6.

$\begin{matrix} i_{f} (p, ⊓_{κ \in {κ_{⊔} (β, T, t) : β \in α^{-}}} κ) & \leq \sum_{\emptyset \neq B \subseteq {κ_{⊔} (β, T, t) : β \in α^{-}}} {(- 1)}^{| B | - 1} i_{f} (p, ⨆_{κ \in B} κ) \end{matrix}$

(A18a)

$\begin{matrix} i_{f} (p, ⊓_{β \in α^{-}} κ_{⊔} (β, T, t)) & \leq \sum_{\emptyset \neq B \subseteq {κ_{⊔} (β, T, t) : β \in α^{-}}} {(- 1)}^{| B | - 1} i_{f} (p, ⨆_{κ \in B} κ) \end{matrix}$

(A18b)

We can construct an upper bound on $i_{f} (p, κ_{⊔} (α, T, t))$ based on the cover set $α^{-}$ as shown in Equation A19.

$\begin{matrix} \forall β \in α^{-} . & β & ⪯ α \end{matrix}$

(A19a)

$\begin{matrix} \forall β \in α^{-} . & κ_{⊔} (α, T, t) & ⊑ κ_{⊔} (β, T, t) & (by Lemma 1) \end{matrix}$

(A19b)

$\begin{matrix} κ_{⊔} (α, T, t) & ⊑ ⊓_{β \in α^{-}} κ_{⊔} (β, T, t) \end{matrix}$

(A19c)

$\begin{matrix} i_{f} (p, κ_{⊔} (α, T, t)) & \leq i_{f} (p, ⊓_{β \in α^{-}} κ_{⊔} (β, T, t)) & (by Lemma A 2) \end{matrix}$

(A19d)

By transitivity of Equation A18b and A19d, we obtain Equation A20.

$i_{f} (p, κ_{⊔} (α, T, t)) \leq \sum_{\emptyset \neq B \subseteq {κ_{⊔} (β, T, t) : β \in α^{-}}} {(- 1)}^{| B | - 1} i_{f} (p, ⨆_{κ \in B} κ)$

(A20)

By Equation A17 and A20, we obtain the non-negativity of pointwise partial information as shown in Equation A21.

$α \in A (V) ∖ {⊥_{SL}} . 0 \leq Δ i_{\cup, f} (α, T, t)$

(A21)

From Equation A15 and A21, we obtain that pointwise partial information is non-negative for all atoms of the lattice:

\forall α \in A (V) . 0 \leq Δ i_{\cup, f} (α, T)

(A22)

If all pointwise partial components are non-negative, then their expected value will also be non-negative (see Equation 34):

\forall α \in A (V) . 0 \leq Δ I_{\cup, f} (α; T)

(A23)

□

Appendix C Scaling f-Information Does Not Affect Its Transformation

Lemma A8.

The linear scaling of an f-information does not affect the transformation result and operator: Consider scaling an f-information measure

I_{a_{2}} (S; T) = k \cdot I_{a_{1}} (S; T)

with

k \in (0, \infty)

, then their decomposition transformation to another measure

I_{b} (S; T)

will be equivalent.

Proof.

Based on the definitions of Section 3.2, the loss measures scale linear with the scaling of their f-divergence. Therefore, we obtain two cumulative loss measures, where

I_{\cup, a_{1}}

and

I_{\cup, a_{2}}

are a linear scaling of each other (Equation A24a). They can be transformed into another measure

I_{\cup, b}

, as shown in Equation A24b.

\begin{matrix} I_{\cup, a_{2}} (α; T) & = k \cdot I_{\cup, a_{1}} (α; T) \end{matrix}

(A24a)

\begin{matrix} I_{\cup, b} (α; T) & = v_{1} (I_{\cup, a_{1}} (α; T)) = v_{2} (I_{\cup, a_{2}} (α; T)) \end{matrix}

(A24b)

Equation A24b already demonstrates that their transformation results will be equivalent and that

v_{1} (z) = v_{2} (k \cdot z)

and

k \cdot v_{1}^{- 1} (z) = v_{2}^{- 1} (z)

. Therefore, their operators will also be equivalent as shown below:

□

Appendix D Decomposition Example Distributions

The probability distributions used in Figure 7 can be found in Table A1. For providing an intuition of the decomposition result for

I_{\cup, TV}

at the generic example, we visualized its corresponding zonogons in Figure A2. It can be seen that the maximal zonogon height is obtained from

V_{1}

(blue) which equals the maximal zonogon height of their joint distribution

(V_{1}, V_{2})

(red). Therefore,

I_{\cup, TV}

does not attribute partial information uniquely to

V_{2}

or their synergy by Lemma 2.

Table A1. The used distributions from [13] and the generic example from [20]. The example names are abbreviations for: XOR-gate (XOR), Unique (Unq), Pointwise Unique (PwUnq), Redundant-Error (RdnErr), Two-Bit-copy (Tbc) and the AND-gate (AND) [13].

			Probability
$V_{1}$	$V_{2}$	T	XOR	Unq	PwUnq	RdnErr	Tbc	AND	Generic
0	0	0	1/4	1/4	0	3/8	1/4	1/4	0.0625
0	0	1	-	-	-	-	-	-	0.3000
0	1	0	-	1/4	1/4	1/8	-	1/4	0.1875
0	1	1	1/4	-	-	-	1/4	-	0.1500
0	2	1	-	-	1/4	-	-	-	-
1	0	0	-	-	1/4	-	-	1/4	0.0375
1	0	1	1/4	1/4	-	1/8	-	-	0.0500
1	0	2	-	-	-	-	1/4	-	-
1	1	0	1/4	-	-	-	-	-	0.2125
1	1	1	-	1/4	-	3/8	-	1/4	-
1	1	3	-	-	-	-	1/4	-	-
2	0	1	-	-	1/4	-	-	-	-

Figure A2. Visualization of the zongons from the generic example of [20] at state

t = 0

. The target variable T has two states. Therefore, the zonogons of its second state are symmetric (second column of Equation 6) and have identical heights.

Figure A2. Visualization of the zongons from the generic example of [20] at state

t = 0

. The target variable T has two states. Therefore, the zonogons of its second state are symmetric (second column of Equation 6) and have identical heights.

Appendix E The Relation of Total Variation to the Zonogon Height

Proof of Lemma 2 a) from Section 4.1.2:

The pointwise total variation (

i_{TV}

) is a linear scaling of the maximal (Euclidean) height

h^{*}

that the corresponding zonogon

Z (κ)

reaches above the diagonal as visualized in Figure 8 for any

0 \leq p \leq 1

i_{TV} (p, κ) = \frac{1 - p}{2} \sum_{v \in κ} | v_{x} - v_{y} | = (1 - p) \frac{h^{*}}{\sqrt{2}}

Proof.

The point of maximal height

P^{*}

that a zonogon

Z (κ)

reaches above the diagonal is visualized in Figure 8 and can be obtained as shown in Equation A25, where

Δ \vec{v}

represents the slope of vector

\vec{v}

P^{*} = \sum_{\vec{v} \in {\vec{v} \in κ : Δ \vec{v} > 1}} \vec{v}

(A25)

The maximal height (Euclidean distance) above the diagonal is calculated as shown in Equation A26, where

P^{*} = (P_{x}^{*}, P_{y}^{*})

h^{*} = \frac{1}{2} {∥(\begin{matrix} P_{x}^{*} - P_{y}^{*} \\ P_{y}^{*} - P_{x}^{*} \end{matrix})∥}_{2} = \sqrt{{(P_{x}^{*} - P_{y}^{*})}^{2} + {(P_{y}^{*} - P_{x}^{*})}^{2}} = \sqrt{2} (P_{y}^{*} - P_{x}^{*})

(A26)

The pointwise total variation

i_{TV}

can be expressed as invertible transformation of the maximal euclidean zonogon height above the diagonal as shown in Equation E, where

\vec{v} = ({\vec{v}}_{x}, {\vec{v}}_{y})

\begin{matrix} i_{TV} (p, κ) & = \sum_{\vec{v} \in κ} \frac{1}{2} |\frac{{\vec{v}}_{x}}{p {\vec{v}}_{x} + (1 - p) {\vec{v}}_{y}} - 1| (p {\vec{v}}_{x} + (1 - p) {\vec{v}}_{y}) \\ = \frac{1 - p}{2} \sum_{\vec{v} \in κ} |{\vec{v}}_{x} - {\vec{v}}_{y}| \\ = \frac{1 - p}{2} (\sum_{\vec{v} \in {\vec{v} \in κ : Δ \vec{v} > 1}} ({\vec{v}}_{y} - {\vec{v}}_{x}) + \sum_{\vec{v} \in {\vec{v} \in κ : Δ \vec{v} \leq 1}} ({\vec{v}}_{x} - {\vec{v}}_{y})) \\ = \frac{1 - p}{2} ((P_{y}^{*} - P_{x}^{*}) + ((1 - P_{x}^{*}) - (1 - P_{y}^{*}))) & (by Equation A 25) \\ = (1 - p) (P_{y}^{*} - P_{x}^{*}) \\ = (1 - p) \frac{h^{*}}{\sqrt{2}} & (by Equation A 26) \end{matrix}

□

Proof of Lemma 2 b) from Section 4.1.2:

For a non-empty set of pointwise channel

A

and

0 \leq p \leq 1

, pointwise total variation

i_{TV}

quantifies the join element to the maximum of its individual channels:

i_{TV} (p, ⨆_{κ \in A} κ) = max_{κ \in A} i_{TV} (p, κ)

Proof.

The join element

Z (⨆_{κ \in A} κ)

corresponds to the convex hull of all individual zonogons (see Equation 7). The maximal height that the convex hull reaches above the diagonal is equal to the maximum of the maximal height that each individual zonogon reaches. Since pointwise total variation is a liner scaling of the (Euclidean) zonogon height above the diagonal (Lemma 2 a) shown above), the join element is valuated to the maximum of its individual channels. □

Appendix F Information Flow Example Parameters and Visualization

The parameters for the Markov chain used in Section 4.2 are shown in Equation A27, where

M_{n} = (X_{n}, Y_{n})

X_{i} = {0, 1, 2}

Y_{i} = {0, 1}

P_{M_{1}}

is the initial distribution and

P_{M_{n + 1} ∣ M_{n}}

is the transition matrix. The visualized results for the information flow of KL-, TV-, and

χ^{2}

-information can be found in Figure 9, and the visualized results of

H^{2}

-, LC-, and JS-information in Figure A3.

\begin{matrix} \begin{matrix} States (X_{1}, Y_{1}) : & (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) \\ P_{M_{1}} = & [\begin{matrix} 0.01 & 0.81 & 0.00 & 0.02 & 0.09 & 0.07 \end{matrix}] \end{matrix} \end{matrix}

(A27a)

\begin{matrix} P_{M_{n + 1} ∣ M_{n}} = [\begin{matrix} 0.05 & 0.01 & 0.04 & 0.82 & 0.02 & 0.06 \\ 0.05 & 0.82 & 0.00 & 0.01 & 0.06 & 0.06 \\ 0.04 & 0.01 & 0.82 & 0.05 & 0.04 & 0.04 \\ 0.03 & 0.84 & 0.02 & 0.06 & 0.04 & 0.01 \\ 0.04 & 0.03 & 0.03 & 0.02 & 0.06 & 0.82 \\ 0.07 & 0.04 & 0.01 & 0.03 & 0.81 & 0.04 \end{matrix}] \end{matrix}

(A27b)

Figure A3. Analysis of the Markov chain information flow (Equation A27). Visualized results for the information measures:

H^{2}

, LC, and JS. The remaining results (KL, TV, and

χ^{2}

) can be found in Figure 9.

Figure A3. Analysis of the Markov chain information flow (Equation A27). Visualized results for the information measures:

H^{2}

, LC, and JS. The remaining results (KL, TV, and

χ^{2}

) can be found in Figure 9.

References

Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. arXiv 2515, arXiv:1004.2515. [Google Scholar]
Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Information Decomposition of Target Effects from Multi-Source Interactions: Perspectives on Previous, Current and Future Work. Entropy 2018, 20. [Google Scholar] [CrossRef]
Griffith, V.; Chong, E.K.P.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection Information Based on Common Randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. Proceedings of the European Conference on Complex Systems 2012; Gilbert, T., Kirkilionis, M., Nicolis, G., Eds.; Springer International Publishing: Cham, 2013; pp. 251–269. [Google Scholar]
Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar] [CrossRef]
Finn, C. A New Framework for Decomposing Multivariate Information. PhD thesis, University of Sydney, 2019.
Polyanskiy, Y.; Wu, Y. Information theory: From coding to learning. Book draft 2022. [Google Scholar]
Mironov, I. Rényi Differential Privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF), 2017, pp. 263–275. [CrossRef]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying Unique Information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Springer Berlin Heidelberg: Berlin, Heidelberg, 2014; pp. 159–190. [CrossRef]
Goodwell, A.E.; Kumar, P. Temporal information partitioning: Characterizing synergy, uniqueness, and redundancy in interacting environmental variables. Water Resources Research 2017, 53, 5920–5942. [Google Scholar] [CrossRef]
James, R.G.; Emenheiser, J.; Crutchfield, J.P. Unique information via dependency constraints. Journal of Physics A: Mathematical and Theoretical 2018, 52, 014002. [Google Scholar] [CrossRef]
Finn, C.; Lizier, J.T. Pointwise Partial Information Decomposition Using the Specificity and Ambiguity Lattices. Entropy 2018, 20. [Google Scholar] [CrossRef] [PubMed]
Ince, R.A.A. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy 2017, 19. [Google Scholar] [CrossRef]
Rosas, F.E.; Mediano, P.A.M.; Rassouli, B.; Barrett, A.B. An operational information decomposition via synergistic disclosure. Journal of Physics A: Mathematical and Theoretical 2020, 53, 485001. [Google Scholar] [CrossRef]
Kolchinsky, A. A Novel Approach to the Partial Information Decomposition. Entropy 2022, 24. [Google Scholar] [CrossRef] [PubMed]
Bertschinger, N.; Rauh, J. The Blackwell relation defines no lattice. 2014 IEEE International Symposium on Information Theory, 2014, pp. 2479–2483. [CrossRef]
Lizier, J.T.; Flecker, B.; Williams, P.L. Towards a synergy-based approach to measuring information modification. 2013 IEEE Symposium on Artificial Life (ALife), 2013, pp. 43–51. [CrossRef]
Knuth, K.H. Lattices and Their Consistent Quantification. Annalen der Physik 2019, 531, 1700370. [Google Scholar] [CrossRef]
Mages, T.; Rohner, C. Decomposing and Tracing Mutual Information by Quantifying Reachable Decision Regions. Entropy 2023, 25. [Google Scholar] [CrossRef]
Blackwell, D. Equivalent comparisons of experiments. The annals of mathematical statistics 1953, 265–272. [Google Scholar] [CrossRef]
Neyman, J.; Pearson, E.S. IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 1933, 231, 289–337. [Google Scholar]
Chicharro, D.; Panzeri, S. Synergy and Redundancy in Dual Decompositions of Mutual Information Gain and Information Loss. Entropy 2017, 19. [Google Scholar] [CrossRef]
Csiszár, I. On information-type measure of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
Rényi, A. On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, 1961, Vol. 4, pp. 547–562.
Sason, I.; Verdú, S. f -Divergence Inequalities. IEEE Transactions on Information Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE transactions on communication technology 1967, 15, 52–60. [Google Scholar] [CrossRef]
Arikan, E. Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels. IEEE Transactions on Information Theory 2009, 55, 3051–3073. [Google Scholar] [CrossRef]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society 1943, 35, 99–110. [Google Scholar]
Mages, T.; Anastasiadi, E.; Rohner, C. Implementation: PID Blackwell specific information. https://github.com/uu-core/pid-blackwell-specific-information, 2024. Accessed on 15.03.2024.
Cardenas, A.; Baras, J.; Seamon, K. A framework for the evaluation of intrusion detection systems. 2006 IEEE Symposium on Security and Privacy (S & P’06), 2006, pp. 15 pp.–77. [CrossRef]
Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information decomposition. 2014 IEEE International Symposium on Information Theory, 2014, pp. 2232–2236. [CrossRef]
Bossomaier, T.; Barnett, L.; Harré, M.; Lizier, J.T. An introduction to transfer entropy. Cham: Springer International Publishing. [CrossRef]

Figure 1. Partial information decomposition representations at two variables

V = {V_{1}, V_{2}}

. (a) Visualization of the desired intuition for multivariate information as Venn diagram. (b) Representation as redundancy lattice, where

I_{\cap}

quantifies the information that is contained in all of its provided variables (inside their intersection). The ordering represents the expected subset relation of redundancy. (c) Representation as synergy lattice, where

I_{\cup}

quantifies the information that is contained in neither of its provided variables (outside their union). (d) When having two partial information decompositions with respect to the same target variable, we can study how the partial information of one decomposition propagates into the next. We refer to this as information flow analysis of a Markov chain such as

T \to (A_{1}, A_{2}) \to (B_{1}, B_{2})

Figure 1. Partial information decomposition representations at two variables

V = {V_{1}, V_{2}}

. (a) Visualization of the desired intuition for multivariate information as Venn diagram. (b) Representation as redundancy lattice, where

I_{\cap}

I_{\cup}

T \to (A_{1}, A_{2}) \to (B_{1}, B_{2})

Figure 2. An example zonogon (blue) for a binary input channel

κ

from

T = {t_{1}, t_{2}}

S = {s_{1}, s_{2}, s_{3}, s_{4}}

. The zonogon is the Neyman-Pearson region and its perimeter corresponds to the vectors

{\vec{v}}_{s_{i}} \in κ

sorted by increasing/decreasing slope for the lower/upper half which results from the likelihood ratio test. The zonogon thus represents the achievable (TPR,FPR)-pairs for predicting T while knowing S.

Figure 2. An example zonogon (blue) for a binary input channel

κ

from

T = {t_{1}, t_{2}}

S = {s_{1}, s_{2}, s_{3}, s_{4}}

. The zonogon is the Neyman-Pearson region and its perimeter corresponds to the vectors

{\vec{v}}_{s_{i}} \in κ

Figure 3. Visualizations for Example 1 where

| T | = 2

. (a) A randomized decision strategy for predictions based on

T \overset{κ}{\to} S

can be represented by a

| S | \times 2

stochastic matrix

λ

. The first column of this decision matrix provides the weights for summing the columns of channel

κ

to determine the resulting prediction performance (TPR, FPR). Any decision strategy corresponds to a point in the zonogon. (b) All presented ordering relations in Section 2.1 are equivalent at binary targets and correspond to the subset relation of the visualized zonogons. The variable

S_{3}

is less informative than both

S_{1}

and

S_{2}

with respect to T, and the variables

S_{1}

and

S_{2}

are incomparable. The shown channel in (a) is the Blackwell join of

κ_{1}

and

κ_{2}

in (b).

Figure 3. Visualizations for Example 1 where

| T | = 2

. (a) A randomized decision strategy for predictions based on

T \overset{κ}{\to} S

can be represented by a

| S | \times 2

stochastic matrix

λ

. The first column of this decision matrix provides the weights for summing the columns of channel

κ

S_{3}

is less informative than both

S_{1}

and

S_{2}

with respect to T, and the variables

S_{1}

and

S_{2}

are incomparable. The shown channel in (a) is the Blackwell join of

κ_{1}

and

κ_{2}

in (b).

Figure 4. For the visualization, we abbreviated the notation by indicating the contained visible variable as index of the source, for example,

S_{12} = {V_{1}, V_{2}}

to represent their joint distribution: (a) A redundancy lattice based on the ordering ≼ of Equation 9. (b) A synergy lattice based on the ordering ⪯ of Equation 10 for the partial information decomposition at

V = {V_{1}, V_{2}, V_{3}}

. On the redundancy lattice, the redundancy of all sources within an atom increases while moving up. On the synergy lattice, the information that is obtained from neither source of an atom increases while moving up.

Figure 4. For the visualization, we abbreviated the notation by indicating the contained visible variable as index of the source, for example,

S_{12} = {V_{1}, V_{2}}

V = {V_{1}, V_{2}, V_{3}}

Figure 5. Example of the unexpected behavior of

I_{\cap}^{min}

: the dashed isoline indicates the pairs

(x, y)

for which channel

κ (x, y) = T \to V_{i}

results in a pointwise information

\forall t \in T : I (V_{i}, T = t) = 0.2

for a uniform binary target variable. Even though observing the output of both indicated example channels (blue/green) provides significantly different abilities for predicting the target variable state, the measure

I_{\cap}^{min}

indicates full redundancy.

Figure 5. Example of the unexpected behavior of

I_{\cap}^{min}

: the dashed isoline indicates the pairs

(x, y)

for which channel

κ (x, y) = T \to V_{i}

results in a pointwise information

\forall t \in T : I (V_{i}, T = t) = 0.2

I_{\cap}^{min}

indicates full redundancy.

Figure 6. This example visualizes the computation of

χ^{2}

-information by indicating its results on the representation of zonogons of an indicator variable. (a) For the pointwise information of

t_{1}

, both vectors of the zonogon perimeter are quantified to the sum 0.292653. (b) For the pointwise information of

t_{2}

, both vectors of the zonogon perimeter are quantified to the sum of

0.130068

. The final

χ^{2}

-information is their expected value

I_{χ^{2}} (S; T) = 0.4 \cdot 0.292653 + 0.6 \cdot 0.130068 = 0.195102

Figure 6. This example visualizes the computation of

χ^{2}

-information by indicating its results on the representation of zonogons of an indicator variable. (a) For the pointwise information of

t_{1}

, both vectors of the zonogon perimeter are quantified to the sum 0.292653. (b) For the pointwise information of

t_{2}

, both vectors of the zonogon perimeter are quantified to the sum of

0.130068

. The final

χ^{2}

-information is their expected value

I_{χ^{2}} (S; T) = 0.4 \cdot 0.292653 + 0.6 \cdot 0.130068 = 0.195102

Figure 7. Comparison of different f-information measures normalized to the f-Entropy of the target variable. All distributions are shown in Appendix D and correspond to the examples of [13,20]. The example name abbreviations are listed below Table A1. The measures behave mostly similarly since the decompositions follow an identical structure. However, it can be seen that total variation attributes more information to being redundant than other measures and appears to behave differently in the generic example since it does not attribute any partial information to the first variable or their synergy.

Figure 8. Visualization of the maximal (Euclidean) height

h^{*}

at point

P^{*}

that a zonogon (blue) reaches above the diagonal.

Figure 8. Visualization of the maximal (Euclidean) height

h^{*}

at point

P^{*}

that a zonogon (blue) reaches above the diagonal.

Figure 9. Analysis of the Markov chain information flow (Equation A27). Visualized results for the information measures: KL, TV, and

χ^{2}

. The remaining results (

H^{2}

-, LC-, and JS-information) can be found in Figure A3.

Figure 9. Analysis of the Markov chain information flow (Equation A27). Visualized results for the information measures: KL, TV, and

χ^{2}

. The remaining results (

H^{2}

-, LC-, and JS-information) can be found in Figure A3.

Table 1. Commonly used functions for f-divergences.

$D_{KL}$	Kullback-Leiber (KL) divergence	$f (z) = z log z$
$D_{TV}$	Total Variation (TV)	$f (z) = \frac{1}{2} \|z - 1\|$
$D_{χ^{2}}$	$χ^{2}$ -divergence	$f (z) = {(z - 1)}^{2}$
$D_{H^{2}}$	Squared Hellinger distance	$f (z) = {(1 - \sqrt{z})}^{2}$
$D_{LC}$	Le Cam distance	$f (z) = \frac{1 - z}{2 z + 2}$
$D_{JS}$	Jensen-Shannon divergence	$f (z) = z log \frac{2 z}{z + 1} + log \frac{2}{z + 1}$
$D_{H_{a}}$	Hellinger divergence with $a \in (0, 1) \cup (1, \infty)$	$f (z) = \frac{z^{a} - 1}{a - 1}$
$D_{α = a}$	$α$ -divergence with $a \in (0, 1) \cup (1, \infty)$	$f (z) = \frac{z^{a} - 1 - a (z - 1)}{a (a - 1)}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

Non-Negative Decomposition of Multivariate Information: from Minimum to Blackwell Specific Information

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. Background

2.1. Blackwell and Zonogon Order

2.2. Partial Information Decomposition

2.3. Information Measures

3. Decomposition Methodology

3.1. Representing f-Information

3.2. Decomposing f-Information

3.3. Decomposing Rényi-Information

4. Evaluation

4.1. Partial Information Decomposition

4.1.1. Comparison of Different f-Information Measures

4.1.2. The special case of total variation

4.2. Information Flow Analysis

5. Discussion

6. Conclusions

Appendix A Quantifying Zonogon Perimeters

Appendix B The Non-Negativity of Partial f-Information

Appendix B.1. Properties of the Loss Measure on the Synergy Lattice

Appendix B.2. Mapping subsets of even and odd cardinality

Appendix B.3. The Non-Negativity of the Decomposition

Appendix C Scaling f-Information Does Not Affect Its Transformation

Appendix D Decomposition Example Distributions

Appendix E The Relation of Total Variation to the Zonogon Height

Appendix F Information Flow Example Parameters and Visualization

References

MDPI Initiatives

Important Links

Subscribe