3.1. Motivation and Bivariate Definition
Consider a distribution and suppose there are two agents, agent 1 and agent 2, whose goal is to reduce their uncertainty about T by observing and , respectively. Suppose also that the agents know , and that agent i has access to its channel distribution . Many PID measures make this same assumption, including . When agent i works alone to reduce the uncertainty about T, since it has access to and , it also knows and , which allows it to compute : the amount of uncertainty reduction about T achieved by observing .
Now, if the agents can work together, that is, if they have access to , then they can compute , because they have access to and . On the other hand, if the agents are not able to work together (in the sense that they are not able to observe Y together, but only and , separately) yet can communicate, then they can construct a different distribution q given by , i.e., a distribution under which and are conditionally independent given T, but have the same marginal and the same individual conditionals and .
The form of
q in the previous paragraph should be contrasted with the following factorization of
p which entails no conditional independence assumption:
. In this sense, we would propose to define union information, for the bivariate case, as follows
where the subscript refers to the distribution under which the mutual information is computed. From this point forward, the absence of a subscript means that the computation is done under the true distribution
p. As we will see, this is not yet the final definition, for reasons to be addressed below.
Using the definition of synergy derived from a measure of union information [
21], for the bivariate case we have
Synergy is often posited as
the difference between the whole and the union of the parts. For our measure of union information, the ‘union of the parts’ corresponds to the reduction of uncertainty about
T - under
q - that agents 1 and 2 can obtain by sharing their conditional distributions. Interestingly, there are cases where the union of the parts is better than the whole, in the sense that
. An example of this is given by the
Adapted ReducedOR distribution, originally introduced by Ince [
20] and adapted by James
et al. [
16], which is shown in the left side of
Table 3, where
. This distribution is such that
does not depend on
r (
), since neither
nor
and
depend on
r; consequently,
also does not depend on
r, as show in the right side of
Table 3.
It can be easily shown that if
, then
, which implies that synergy, if defined as in (
4), could be negative. How do we interpret the fact that there exist distributions such that
? This means that under distribution
q, which assumes
and
are conditionally independent given
T,
and
reduce the uncertainty about
T more than in the original distribution. Arguably, the parts working independently and achieving better results than the whole should mean there is no synergy, as opposed to negative synergy.
The observations in the previous paragraphs motivate our definition of a new measure of union information as
with the superscript CI standing for
conditional independence, yielding a non-negative synergy:
Note that, for the bivariate case, we have 0 synergy if
is such that
, that is, if the outputs are indeed conditionally independent given
T. Moreover,
satisfies the monotonicity axiom from the extension of the Williams-Beer axioms to measures of union information (to be mentioned in
Section 4.1), which further supports this definition.
3.2. Operational Interpretation
For the bivariate case, if
and
are conditionally independent given
T (
Figure 1 (b)), then
and
(and
) suffice to reconstruct the original joint distribution
, which means the union of the parts is enough to reconstruct the whole,
i.e., there is no synergy between
and
. Conversely, a distribution generated by the DAG in
Figure 1 (c) does not satisfy conditional independence (given
T), hence we expect positive synergy, as is the case for the XOR distribution, and indeed our measure yields 1 bit of synergy for this distribution. These two cases motivate the operational interpretation of our measure of synergy: it is the amount of information that is not captured by assuming conditional independence of the sources (given the target).
Recall, however, that some distributions are such that , i.e., such that the union of the parts ‘outperforms’ the whole. What does this mean? It means that under q, and have more information about T than under p: the constructed distribution q, which drops the conditional dependence of and given T, reduces the uncertainty that Y has about T more than the original distribution p. In some cases, this may happen because the support of q is larger than that of p, which may lead to a reduction of uncertainty under q that cannot be achieved under p. In these cases, since we are decomposing , we revert to saying that the union information that a set of variables has about T is equal to , so that our measure satisfies the monotonicity axiom (later introduced in Definition 2). We will comment on this compromise between satisfying the monotonicity axiom and ignoring dependencies later.
3.3. General (Multivariate) Definition
To extend the proposed measure to an arbitrary number
of sources, we briefly recall the synergy lattice [
17,
34] and the union information semi-lattice [
34]. For
, these two lattices are shown in
Figure 2. For the sake of brevity, we will not address the construction of the lattices or the different orders between sources. We refer the reader to the work of Gutknecht
et al. [
34], for an excellent overview of the different lattices, the orders between sources, and the construction of different PID measures.
In the following, we use the term
source to mean a subset of the variables
, or a set of such subsets, we drop the curly brackets for clarity, and refer to the different variables by their indices, as is common in most works on PID. The decomposition resulting from a measure of union information is not as direct to obtain as one obtained from a measure of redundant information, as the solution for the information atoms is not a Möbius inversion [
13]. One must first construct the measure of synergy for source
by writing
which is the generalization of (
6) for an arbitrary
source . In the remainder of this paper, we will often omit “
" from the notation (unless it is explicitly needed), with the understanding that the target variable is always referred to as
T. Also for simplicity, in the following, we identify the different agents that have access to different distributions as the distributions they have access to.
It is fairly simple to extend the proposed measure to an arbitrary number of sources, as illustrated in the following two examples.
Example 1: to compute
, agent
knows
, thus it can also compute, by marginalization,
and
. On the other hand, agent
only knows
. Recall that both agents also have access to
. By sharing their conditionals, the agents can compute
, and also
. After this, they may choose whichever distribution has the highest information about
T, while still holding the view that any information gain larger than
must be disregarded. Consequently, we write
Example 2: slightly more complicated is the computation of . In this case, the three agents may compute four different distributions, two of which are the same and defined in the previous paragraph, and the other two are , and .
Given these insights, we propose the following measure of union information.
Definition 1.
Let be an arbitrary collection of sources (recall sources may be subsets of variables). Without loss of generality, assume that no source is a subset of another source and no source is a deterministic function of other sources. We define
where and Q is the set of all different distributions that the m agents can construct by combining their conditional distributions and marginalizations thereof.
For instance, in Examples 1 above, ; in Example 2, . In Example 1, , whereas in Example 2, .
We now justify the conditions in Definition 1 and the fact that they do not entail any loss of generality.
The condition that no source is a subset of another source (which also excludes the case where two sources are the same) implies no loss of generality: if one source is a subset of another, say , then may be removed without affecting either A or , thus yielding the same value for . The removal of source is also done for measures of intersection information, but under the opposite condition: whenever .
The condition that no source is a deterministic function of other sources is slightly more nuanced. In our perspective, an intuitive and desired property of measures of both union and synergistic information is that their value should not change whenever one adds a source that is a deterministic function of sources that are already considered. We provide arguments in favor of this property in
Section 4.2.1. This property may not be satisfied by computing
without previously excluding such sources. For instance, consider
, where
and
are two i.i.d. random variables following a Bernoulli distribution with parameter 0.5,
(that is,
is deterministic function of
), and
. Computing
without excluding
(or
) yields
and
. This issue is resolved by removing deterministic sources before computing
.
We conclude this section by commenting on the monotonicity of our measure. Suppose we wish to compute the union information of sources and . PID theory demands that (monotonicity of union information). Recall our motivation for : there are two agents, the first has access to and the second to . The two agents assume conditional independence of their variables and construct . The story is similar for the computation of , in which case we have three agents that construct . Now, it may be the case that ; considering only these two distributions would yield , contradicting monotonicity for measures of union information. To overcome this issue, for the computation of - and other sources in general - the agent that has access to must be allowed to disregard the conditional dependence of and on T, even if it holds in the original distribution p.