Preprint
Article

Probability via Expectation Measures

Altmetrics

Downloads

56

Views

25

Comments

0

This version is not peer-reviewed

Submitted:

23 October 2024

Posted:

24 October 2024

You are already at the latest version

Alerts
Abstract
Since the seminal work of Kolmogorov, probability theory has been based on measure theory, where the central components are so-called probability measures, defined as measures with total mass equal to 1. In Kolmogorov’s theory, a probability measure is used to model an experiment with a single outcome that will belong to exactly one out of several disjoint sets. In this paper, we present a different basic model where an experiment results in a multiset, i.e. for each of the disjoint sets we get the number of observations in the set. This new framework is consistent with Kolmogorov’s theory, but the theory focuses on expected values rather than probabilities. We present examples from testing Goodness-of-Fit, Bayesian statistics, and quantum theory, where the shifted focus gives new insight or better performance. We also provide several new theorems that address some problems related to the change in focus.
Keywords: 
Subject: Computer Science and Mathematics  -   Probability and Statistics

MSC:  60A05, 60G55

1. Introduction

Throughout the history of probability theory, some mathematicians have focused on probabilities, and others have focused on expectations. In his seminal paper from 1933, Kolmogorov focused on probabilities [1], but in the present paper, we will develop expectation theory as an alternative to probability theory, focusing on expectations. Expectation theory has been developed to handle technical problems in several applications in information theory, statistics (both frequentist and Bayesian), quantum information theory, and probability theory itself. We will present the basic definitions of our theory and provide interpretations of the core concepts. Standard probability theory, as developed by Kolmogorov, can be embedded in the present theory. Similarly, our theory can be embedded into standard probability theory. The focus is on discrete measures to keep this paper at a moderate length. Since there is no inconsistency between the two theories, there are many cases where expectation theory, as developed here for discrete measures, can also be used for more general measures. Some readers may be more interested in theory, and others may be more interested in applications. The paper is written hoping that the different sections can be read quite independently.

1.1. Organization of the Paper

In Section 2 we point out some significant ideas related to the work of Kolmogorov and we will review some later developments that are important for our topic. The notion of a probability monad will be presented in Section 2.3. This approach allows us to focus on which basic operations are needed to define a well-behaved class of models.
The theoretical framework we will develop may be viewed as a part of the theory of point processes. Usually, the topic of point processes is considered an advanced part of probability theory. Still, we will consider some aspects of the theory of point processes as quite fundamental for modeling randomness. In Section 2.4 we will give a short overview of the relevant concepts of the theory of point processes. For readers familiar with the theory of point processes, it puts our contribution in a well-known context. It will also provide a framework to ensure the rest of the paper provides a consistent theory. In the subsequent sections, our discussions will often focus on finite samples, but the conclusions will also hold for more general samples, which is easy to see with some general knowledge of point processes.
In Section 3 we develop the theory of finite empirical measures. Such finite empirical measures are formally equivalent to multisets. Some basic properties of empirical measures are established. It is pointed out that many problems in information theory can be formulated for empirical measures without reference to randomness.
In Section 4 we introduce finite expectation measures. In Section 4.3 we introduce the Poisson interpretation that allows us to translate between results for expectation measures and results for Poisson point processes with probability measures. The cost of this translation is that the outcome space of the process is infinite, even if the expectation measure is finite. The Poisson interpretation enables us to give probabilistic interpretations of conditioning and independence for general measures, as discussed in Section 4.4 and Section 4.5. In [2] it was demonstrated that the reverse information projection of a probability distribution into a convex set of probability distributions may not be a probability distribution. This has been an important motivation for studying information divergence for general measures, as done in Section 4.6.
In Section 5, we will provide examples of how the present theory gives alternative interpretations and improved results to some familiar problems in decision theory, Bayesian statistics, Testing Goodness-of-Fit, and information theory.
We end the paper with a short conclusion, including a list of concepts in probability theory and the corresponding concepts in expectation theory.

1.2. Terminology

A measure with a total mass of 1 is usually called a probability measure or a normalized measure. We will deviate from this terminology and use the term unital measure for a measure with total mass 1. The term normalized measure will only be used when a unital measure is the result of dividing a finite measure by the total mass of the measure. We will reserve the word probability measure to situations where the weights of a unital measure are used to quantify uncertainty, and it is known that precisely one observation will be made and one can decide which event the observation belongs to in a system of mutually exclusive events that cover the whole outcome space. Similarly, we will talk about an expectation measure if our interpretation of its values are given in terms of expected values of some random variables. Other classes of measures are coding measures that are used in information theory and mixing measures that are unital measures used for barycentric decompositions of points in convex sets.
In standard probability theory, the probability measures lives on a space often called a sample space, but we will use the alternative term, an outcome space. The word sample will be used informally about the result of a sampling process. The result of a point process will be called an instance of the process.

2. Probability Theory Since Kolmogorov

2.1. Kolmogorov’s Contribution

The modern foundation of probability theory is due to Kolmogorov. He contributed in many ways, and here we shall only focus on the aspects that are most relevant for the present paper. His 1933 paper [1] was in line with two ideas earlier mathematicians developed.
The first idea is symbolic logic, as developed by G. Boole. In this approach to logic, the propositions form a Boolean algebra. A truth assignment function is a binary function that assigns one of the values 0 (false) and 1 (true) to any proposition consistently. In particular, if A is a proposition, either A is assigned the value 1 or ¬ A is assigned the value 1. Kolmogorov’s work may be seen as an extension where statements are replaced by events, i.e., measurable sets, and the events A and A are assigned probabilities in 0 , 1 in such a way that P A + P A = 1 . Thus, probability theory can be described as an extension of logic where the functions can take values in 0 , 1 rather than values in { 0 , 1 } . Such extensions have been formalized as monads to be discussed in Section 2.3, but the theory of monads was only developed much later as part of category theory. See [3,4] for a general discussion of probability monads.
Lebesgue’s theory of measures inspired the second main idea in Kolmogorov’s 1933 paper. Measure theory was used to extend the previous definitions of integrals. The basic idea is that a measure is defined on a set of measurable sets. Such a measure should be countable additive like the notion of an area. Measure theory has been beneficial for the theory of integration, and in particular, it leads to nice convergence theorems like Lebesgue’s theorem on dominated convergence. Kolmogorov used measure theory to allow similar general convergence results in probability theory.
Basic probability theory is defined on measurable spaces, but several essential theorems do not hold for all measurable spaces. Therefore, it is often assumed that the outcome space is a standard Borel space. Such a space emerges if the measurable sets are the Borel sets of a topology defined by a complete separable metric space. This assumption will cover most applications. A standard Borel space has a one-to-one measurable mapping from the outcome space to the unit interval equipped with the Borel σ -algebra.
The primary objects in Kolmogorov’s probability theory are the outcome space and a probability measure on the outcome space. All probabilities are with respect to this outcome space and this probability measure. This assumption leads to a consistent theory, but it is just assumed that an outcome space and a probability measure exist. Theorems like the Daniell-Kolmogorov consistency theorem (also called Kolmogorov’s extension theorem [[5] page 246, Theorem 1]) make it quite explicit that the existence of an outcome space is a consistency assumption. For a random process X = ξ t t T where T R we define the finite-dimensional distribution functions by
F t 1 , , t n x 1 , , x n = P ω : ξ t 1 x 1 , , ξ t n x n
defined for all sets with t 1 < t 2 < < t n . For the random X = ξ t t T the finite-dimensional distributions function satisfies the Chapman-Kolmogorov equations stated below.
lim x k F t 1 , , t n x 1 , , x n = F t 1 , , t ^ k , , t n x 1 , , x ^ k , , x n
where ∧ indicates an omitted coordinate
 Theorem 1
(Kolmogorov’s Theorem on the existence of a process). Let { F t 1 , , t n x 1 , x n } , with t i T R , t 1 < t 2 < < t n , n 1 , be a given family of finite- dimensional distributions, satisfying the Chapman-Kolmogorov Equations (2). Then , there exists a probability space Ω , F , ϕ and a random process X = ξ t t T such that
P ω : ξ t 1 x 1 , , ξ t n x n = F t 1 , , t n x 1 , , x n
Later, category theory was developed, and commutative diagrams in category theory are exactly a language for expressing this type of consistency.

2.2. Probabilities or Expectations?

To a large extent, the present paper may be viewed as an extension of the point of view presented by Wittle [6]. By identifying an event with its indicator function, his exposition is formally equivalent to Kolmogorov’s probability theory.
If X = Ω , F is a measurable space, we may define F Ω , F as the set of bounded F -measurable functions Ω R . For any unital measure μ on F Ω , F we may define the functional E μ : F Ω , F R by
E μ f = Ω f d μ .
The functional satisfies that
E μ f 0 , when f 0 ;
E μ 1 = 1 .
Any functional that satisfies these two conditions may be identified with a unital measure.
To describe weak convergence, we are interested in the outcome space as a topological space rather than as a measurable space. A second countable space is also a Lindelöf space, i.e., any open covering has a countable sub-covering. If the measure μ is locally finite, then the whole space has a covering by open sets of finite measures. In particular, the measure μ is σ -finite.
If the space is a locally compact Hausdorff space, then a locally finite measure is the same as a finite measure on compact sets. For such spaces, the integral
Ω f ω d μ ω
is well-defined for any function f that is continuous with compact support. Radon measures can be identified with positive functionals on C c Ω . With these conditions, the duality between Radon measures and continuous functions with compact support works perfectly without any pathological problems.
On a locally compact Hausdorff space, the finite measures can be identified with positive functionals on the continuous functions on the one-point compactification of the space.

2.3. Probability Theory and Category Theory

For the categorical treatment of probability theory, we first have to recall the definition of a transition kernel [[7] Chapter 1].
 Definition 1.
Let Ω , F and S , S be two measurable spaces. A transition kernel  ω μ ω from Ω to S is a function
μ : Ω × S 0 ,
such that:
  • For any fixed B S the mapping
    ω μ ω B
    is measurable.
  • For every fixed ω Ω , the mapping
    B μ ω B
    is a measure on S , S .
Let M + Ω , F denote the measures on Ω , F . If P M + Ω , F , then a measure on S , S is given by
B Ω μ ω B d P ω .
Thus, a transition kernel may be identified with a mapping M + Ω , F M + S , S that we will call a transition operator.
If the measure μ ω is a unital measure for any ω Ω , then the transition kernel is called a Markov kernel. The key observation for the categorical treatment of probability theory is that Markov kernels can be composed.
The first to put probability theory into the framework of category theory seems to be Lawvere [8]. He defined a category Pro that has measurable spaces as objects and Markov kernels as morphisms. The category Pro contains the singleton sets as initial objects. A probability measure on Ω , F can then be identified with a morphism from an initial object to Ω , F .
Later, the theory of monads was introduced in category theory, and monads have had a significant impact on functional programming [9]. The first to describe the category Pro in terms of monads was Giry [10]. First, we consider the category of measurable spaces Maes. It has measurable spaces as objects and measurable maps between measurable spaces as morphisms.
To the measurable space X = Ω , F we associate the set M + 1 X of probability measures on X. The set M + 1 X is equipped with the smallest σ -algebra such that the maps f E μ f are all measurable where E μ f = f d P . If g is a measurable map from X 1 to X 2 then a measurable map M + 1 g from M + 1 X 1 to M + 1 X 2 is defined by
M + 1 g E μ f = E μ f g ,
which can also be written as M + 1 g ϕ = E μ F g . If f is the indicator function of the measurable set B and the functional ϕ is given by the probability measure μ , then we get
M + 1 g μ B = μ g 1 B ,
which is the usual definition of an induced probability measure. In this way M + 1 is a functor from the category Maes to itself, i.e., an endofunctor.
The morphisms in the category Pro are Markov operators M + 1 X 1 M + 1 X 2 , but Markov operators are given by Markov kernels. It is useful to describe in detail how one can switch between Markov operators and Markov kernels. For this purpose, we introduce a natural transformation δ that translates Markov operators into Markov kernels, and and we introduce a natural transformation π that translates Markov kernels into Markov operators.
A measurable space X = Ω , F can be embedded into M + 1 X by mapping ω Ω into the Dirac measure δ ω . In this way δ ω f = f ω . Now δ may be considered as a measurable map, i.e., a morphism in the category Maes, and the following diagram commutes.
Preprints 122115 i001
Thus, δ is a natural transformation from the identity functor in the category Maes to the functor M + 1 . If Ψ : M + 1 X 1 M + 1 X 2 is a Markov operator, then the corresponding Markov kernel ψ : X 1 M + 1 X 2 is given by ψ x = Ψ δ x .
We will use M + 1 2 to denote the functor M + 1 M + 1 . We have a measurable map π from M + 1 2 X to M + 1 X that maps μ M + 1 2 X to
π E μ f = M + 1 X E ν f d μ ν .
Then we have the following commutative diagram
Preprints 122115 i002
so that π is a natural transformation from M + 1 2 to M + 1 . If ψ : X 1 M + 1 X 2 is a Markov kernel, then the corresponding Markov operator Ψ is given by Ψ = π M + 1 ψ . Thus, the following diagram commutes.
Preprints 122115 i003
A Markov kernel from X 1 to X 2 may now be described as a measurable map ψ 1 from X 1 to M + 1 X 2 . Composition of Markov kernels is given by
ψ 2 ψ 1 = π M + 1 ψ 2 ψ 1
and we have the identities
δ ψ = ψ ,
ψ δ = ψ ,
ψ 1 ψ 2 ψ 3 = ψ 1 ψ 2 ψ 3 .
The first two identities can be translated into the following commutative diagram
Preprints 122115 i004
Whenever this diagram commutes, we say that δ  acts as an identity. Associativity means that the following diagram commutes.
Preprints 122115 i005
and we say that the functor M + 1 is associative.
When an endofunctor M + 1 together with two natural transformations δ and π satisfies associativity and δ acts as the an identity, we say that M + 1 , δ , π forms a monad. For any monad, morphisms from X to M + 1 X can be composed by Equations (18) leading to the Kleisli composition of morphisms. For a category with a monad, the Kleisli category of the monad has the same objects as the original category, and as morphisms, it has the Kleisli morphisms. In this way, the category of Markov kernels can be identified with the Kleisli morphisms associated with the monad M + 1 , δ , π . The Kleisli category is equivalent to the category introduced by Lawvere. The equivalence is established by a functor that maps the object X into M + 1 ( X ) and maps Kleisli morphisms into their extensions.

2.4. Preliminaries on Point Processes

First, we will define a point process with points in S. Typically, S will be a d-dimensional Euclidean space, but S could, in principle, denote any complete separable metric space. Let S denote the Borel σ -algebra on S. Let Ω , F , P denote a probability space. A transition kernel ω μ ω from Ω , F to M + S , S is called a point process if
  • For all ω Ω the measure μ ω · : S R 0 , + is locally finite.
  • For all bounded sets B S the random variable ω μ ω B : Ω R 0 , + is a count variable.
For further details about point processes, see [11] or [[7] Chapter 3].
The interpretation is that if the outcome is ω then μ ω is a measure that counts how many points there are in various subsets of S, i.e., μ ω B is the number of points in the set B S . Each measure μ ω will be called an instance of the point process. In the literature on point processes, one is often interested in simple point processes where μ ω B = 0 when B is a singleton. However, point processes that are not simple are also crucial for the problems that will be discussed in this paper.
The definition of a point process follows the general structure of probability theory, where everything is based on a single underlying probability space. This will ensure consistency, but often this probability space has to be quite large if several point processes or many random variables are considered simultaneously.
The measure μ is called the expectation measure of the process ω μ ω if for any B S we have
μ B = Ω μ ω B d P ω .
The expectation measure gives the mean value of the number of points in the set B. Different point processes may have the same expectation measure.
A one-point process is a process that outputs precisely one point with probability 1. For a one-point process the expectation measure of the process is simply a probability measure on S. Thus, probability measures can be identified with one-point processes. It is possible to define a monad for point processes [12]. The monad defined in [12] is also related to the observation that the Giry monad is distributive over the multiset monad as discussed in [13]. These results are closely related to the results we will present in the following sections.
A point process can be thinned, meaning each point in an instance μ ω is kept or deleted according to some random procedure. We can do α -thinning for α 0 , 1 by keeping a point in the sample with probability α and deleting it from the sample with probability 1 α . This is done independently for each point in the sample. Thus ω ν ω is an α -thinning of ω μ ω if
P ν ω B = ν B = μ B ν B α ν B 1 α μ B ν B
for any measurable set B. If ω ν ω is an α -thinning of ω μ ω we write ν ω = T α μ ω .

2.5. Poisson Distributions and Poisson Point Processes

For λ 0 , the Poisson distribution P o λ is the probability distribution on N 0 with point probabilities
P o j , λ = λ j j ! exp λ .
The Poisson distribution may be viewed as a point process on a singleton set. The set of Poisson distributions is closed under addition and thinning.
P o λ 1 * P o λ 2 = P o λ 1 + λ 2 ,
T α P o λ = P o α · λ .
For any locally finite measure μ on S , S that has density Λ with respect to the Lebesgue measure, can define a Poisson point process with intensity  Λ as a point process ω μ ω with expectation measure μ [[7] Chapter 3]. The following two properties characterize the Poisson point process.
  • For all B S the random variable ω μ ω B is Poisson distributed with mean value μ B .
  • If B 1 , B 2 S are disjoint, then the random variables ω μ ω B 1 and ω μ ω B 2 are independent.
The Equations (27) and (28) also hold if the numbers λ are replaced by measures. The following result was essentially proved by Rényi [14] and Kallenberg [15].
 Theorem 2.
Let P be a unital measure and let ω μ ω be independent point processes with expectation measure π P = μ . Then T 1 / n * i = 1 n μ ω i converges to P o μ in total variation.

2.6. Valuations

Until now, we have presented the results in terms of measure theory. For a more general theory, it is helpful to switch to from measures to valuations [16]. Any measure μ satisfies the following properties.
 Strictness
  μ = 0 .
 Monotonicity
 For all subsets A B , implies μ A μ B .
 Modularity
 For all subsets A , B ,
μ A + μ B = μ A B + μ A B .
For any lattice with a bottom element a valuation is defined as a function that satisfies strictness, monotony, and modularity. The notion of a valuation makes sense in any lattice with a bottom element.
We are mainly interested in valuations that are continuous in the following sense. The possible values of a valuation are lower reals in 0 , , which are elements in the set 0 , with a topology of lower bounded open intervals. Thus, a function into 0 , is continuous, if it is lower semicontinuous when 0 , the usual topology.
 Continuity
  μ sup λ A λ = sup λ μ A λ for any directed net A λ .
This notion of continuity captures both the inner regularity of a measure and it captures σ -additivity.
A Borel measure restricted to the open sets of a topological space is a valuation. On any complete separable metric space any continuous valuation on the open sets is the restriction of a Borel measure. It will not make any difference whether we speak of measures or valuations for the applications that we will discuss in Section 5
If X is a topological space, then V ( X ) denotes the set of continuous valuations on X. The set of valuations has a structure as a topological space, so that V is an endofunctor in the category Top of topological spaces. The functor V defines a monad that is called the extended probabilistic power domain [17]. It is defined in much the same way as the monads defined by Giry.

3. Observations

The outcome space plays a central role in Kolmogorov’s approach to probability. The basic experiment in his theory will result in a single point in the outcome space. The measurable subsets of the outcome space play the same role as the propositions play in a Boolean algebra in logic. The principle of excluded middle in logic states that any proposition is either true or false. Similarly, the basic experiment in Kolmogorov’s theory results in a single point, and these points exclude each other. In this paper, the result of a basic experiment will be a multiset rather than a single point.

3.1. Observations as Multiset Classifications

In computer science, we operate with different data types. A set is a collection of objects, where repetition and order are not relevant. A list is a collection of objects, where order and repetition are relevant. Multisets are unordered, but repetitions are relevant, so objects are counted with multiplicity. We shall discuss the relation between these data types in this subsection.
A review of the theory of multisets can be found in [18]. Monro [19] discusses two ways of defining a multiset, and the distinction between these definitions is important for the present work.
The following example is similar to what can be found in basic textbooks on statistics.
 Example 1.
Consider a list of observations of five animals like ( c o w , h o r s e , b e e , s h e e p , f l i e ) . The list may be represented by a table with a number as a unique key that indicates the order in which we have made the observations. In the present example all the animals are different and if we are not interested in the order in which
Key Animal
1 cow
2 horse
3 bee
4 sheep
5 flie
we have made the observations, we may represent the observations by the set Ω = { b e e , c o w , f l i e , h o r s e , s h e e p } where the animals have been displayed in alphabetic order, but in a set the order does not matter.
These animals can be classified as either mammals or insects, leading to the list of observations ( m a m m a l , m a m m a l , i n s e c t , m a m m a l , i n s e c t ) or, equivalently, to the table. Since the list contains repetitions, we cannot represent it by the set A =
Key Animal
1 mammal
2 mammal
3 insect
4 mammal
5 insect
{ i n s e c t , m a m m a l } . Instead, we may represent it by the multiset i n s e c t , i n s e c t , m a m m a l , m a m m a l , m a m m a l .
According to the first definition of a multiset by Monro [19], a multiset is a set Ω with an equivalence relation ≃. If A denotes the set of equivalence classes, then we get a mapping g : Ω A . Alternatively, any mapping g : Ω A leads to equivalences classes on Ω . Dedekind was the first to identify a multiset with a function [18]. To each equivalence class we assign the number of elements in the equivalence class. This is called the multiplicity of the equivalence class.
A category Mul with multisets as objects was defined by Monro [19]. An object in the category Mul is a set Ω with an equivalence relation ≃. If Ω 1 , 1 and Ω 2 , 2 are objects in the category Mul, then a morphism from Ω 1 , 1 to Ω 2 , 2 is a mapping f : Ω 1 Ω 2 that respects the equivalence relations, i.e., if x 1 y in Ω 1 , 1 then f x 2 f y in Ω 2 , 2 . This category has been studied in more detail in [20].
An equivalence is often based on a partial preorder. Let ⪯ denote a partial preordering on Ω . Then, an equivalence relation ≃ is defined by a b if and only a b and b a . If A = ω / then ⪯ induces an ordering on A , that we will also denote ⪯.
The downsets (hereditary sets) in Ω , form a distributive lattice with ∩ and ∪ as lattice operations. If the set A is finite, then the lattice is finite, and if A , satisfies the decending chain condition (DCC) then so does the lattice of downsets. Any finite distributive lattice can be represented by a finite poset [[21] Thm. 9], and a distributive that satisfies DCC can be represented by a poset that satisfies DCC. There is a one-to-one correspondence between the elements of A and the irreducible elements of the lattice. This construction can be viewed as a concept lattices [22] based on the partial ordering ⪯.
The partial preordering ⪯ is an equivalence relation if and only if the lattice generated is a Boolean lattice. Therefore, we get a lattice where one cannot form complements except if ⪯ is an equivalence relation. The shift from Boolean lattices to more general lattices corresponds to shifting from logic with a law of excluded middle to more general classification systems.
The set of downset in Ω , is closed under finite intersections and under arbitrary unions because any union is a finite union. Thus, the downsets in Ω , form a topology. The continuous functions for this topology are monotone functions for the ordering. Topological spaces and continuous functions form a category. If some of the equivalence classes in Ω , has more than one element, then the topology does not satisfy the separation condition T 0 , but the topology on equivalence classes always satisfies T 0 .
Multisets can also be described using σ -algebras as it is done in probability theory. A classification on Ω given by an equivalence relation ≃ or by a partial preordering ⪯ leads to a topology on Ω . This topology generates a Borel σ -algebra on Ω . For any two outcomes ω 1 , ω 2 Ω we have ω 1 ω 2 if and only if ω 1 B ω 2 B for all Borel measurable sets B.

3.2. Observations as Empirical Measures

According to the second definition of a multiset discussed by Monro [19], a multiset is a mapping of a set A into N 0 that gives the multiplicity of each of the elements. This corresponds to the data types in statistics that are called count data. Here, we will represent such multisets by finite empirical measures.
 Example 2.
The list ( m a m m a l , m a m m a l , i n s e c t , m a m m a l , i n s e c t ) can be written as a multiset i n s e c t , i n s e c t , m a m m a l , m a m m a l , m a m m a l or, equivalently, we may represent the multiset by the measure 2 · δ i n s e c t + 3 · δ m a m m a l . Alternatively, the multiset can be represented by a table of frequencies.
Classification Frequency
insect 2
mammal 3
The mapping from lists to an empirical measure is called the accumulation map, and it is denoted A c c .
 Definition 2.
Let A , τ be a topological space. By afinite empirical measure, we understand a finite sum of Dirac measures on the Borel σ-algebra of A , τ . The set of finite empirical measures on A , τ will be denoted M A , τ or M A for short.
With these definitions, M is an endofunctor on the category Maes of measurable spaces.
When the notion of observation is based on an equivalence, there is an implicit assumption that any two elements each has its own identity, but at the same time, they are equivalent according to a classification. In quantum physics, particles may be indistinguishable in a way that one would not see in classical physics. The following example shows that there are data sets that the first definition of a multiset cannot handle.
 Example 3.
Young’s experiment, also called the double slit experiment, was first used by Young as a strong argument for the wave-like nature of light. Nowadays, it is often taken as an illustration of how quantum physics fundamentally differs from classical physics. The observations are often described as a paradox, but at least part of the paradox is related to a wrong presentation of the observations.
In its modern form, the experiment uses monochromatic light from a laser. The laser beam is first sent through a slit in a screen. The electromagnetic wave spreads like concentric circles after passing through the first slit, as illustrated in Figure 1. The wave then hits a second screen with two slits. After passing the two slits we get two waves that spread like concentric circles until they hit a photographic film that will display an interference pattern created by the two waves.
Now follows a "paradox" as it is described in many textbooks. The electromagnetic wave is quantized into photons, so if the intensity of the light is low we will only get separate points on the photographic film, but we get the same interference pattern as before. This may be explained as interference between photons passing through the left and the right slit respectively. Now we lower the intensity so much that the photons arrive at the photographic film one by one. After a long time of exposure, we get the same interference pattern as before. This apparently gives a paradox because it is hard to understand how a single photon should pass through both slits and have interference with itself.
A problem with the above description is that the number of photons emitted from a laser is described by a Poisson point process. Even if the intensity is low, one cannot emit a single photon for sure. If the mean number of photons emitted is say 1 then there is still a probability that 0 or 2 (or more) photons are emitted and this will hold even if the intensity is very low. What we observe is a number of dots on the photographic film, which can be described as by a multiset if the photographic film is divided into regions. If we really want to send single photons, it could be done using a single-photon source. For a single-photon source there is no uncertainty in the number of photons emitted, but according to the time-energy uncertainty relation there will still be uncertainty in the energy. The energy E of a photon is given by
E = h f = h c λ
where h is Planck’s constant, f is the frequency, c is the speed of light, and λ is the wave length. Thus, uncertainty in the energy means uncertainty in frequency and wave length. Since the interference pattern observed on the photographic film depends on the wave length, one would not get the same interference pattern if a single-photon emitter was used instead of coherent light from a laser.
We see that the idea that the observation in terms of a multiset comes from observation of a large number of individual photons is simply wrong and leads to an inconsistent description of the experiment.
We can do the following operations with empirical measures.
  • Addition.
  • Restriction.
  • Inducing.
The first operation we will look at is addition. Let 1 and 2 denote two lists of observations from the same alphabet A , and let 1 * 2 denote the concatenation of the lists. Then A c c 1 * 2 = A c c 1 + A c c 2 . Thus, the sum μ 1 + μ 2 of two empirical measures has an interpretation via merging two datasets together. The corresponding operation for point processes is called superposition. Addition of empirical measures is a way of obtaining an empirical measure without using the accumulation map on a single list.
The next operation we will look at is restriction. If data is described by the empirical measure μ on A and B is a subset of A then the restriction of μ to subsets of B is an empirical measure on B that we will denote by μ B . In probability theory all measures should be unital so we need the notion of conditional probability, but multisets cannot be normalized, and the concept simplifies to the notion of restriction.
Assume that g : A B is a continuous (or measurable) mapping. Then the induced measure g μ is defined by
g μ B = μ g 1 B ,
where μ is an empirical measure on A and B is a measurable subset of B . The measure M g μ is called the induced measure and is often denoted g μ rather than M g μ .
One can easily prove that inducing is additive in the measure and similar basic results regarding the interaction between addition, restriction and creation of induced measures.

3.3. Categorical Properties of the Empirical Measures and Some Generalizations

Addition, restriction, and inducing can be described in the language of category theory. The existence of addition means that the category of multisets is commutative. Restriction can be characterized as a retraction that is additive. The existence of induced measures means that M is a functor from topological spaces to measure spaces.
If A , τ is a topological space, then we equip M A , τ with the smallest topology such that
μ f d μ
are continuous for all continuous functions f. Here, the integral f d μ simplifies to a sum i = 1 n f a i when μ = i = 1 n δ a i . If we define M g μ = g μ then M becomes an endofunctor in the category of topological spaces and a monad can be defined in exactly the same way as for the Giry monad. Kleisli morphisms map points in a topological space into empirical measures on the topological space. Empirical measures will often be denoted μ ω in order to emphasize that an empirical measure typically is the results of some sampling process.
One can generalize finite empirical measures on a topological space to continuous integer valued valuations on a topological space. These integer valued valuations form a sub-monad of the monad of all continuous valuations.
As we have seen in Section 3.1 classifications may lead to other lattices than Boolean lattices, so it is relevant to discuss integer-valued valuations on more general distributive lattices. We shall work out the theory for finite lattices in all details below. If Ω , τ is a finite topological space, then we can define an integer-valued valuation v on the topology τ by
v B = # B ,
where # B means the number of elements in the open set B τ .
 Lemma 1.
Let v be a continuous valuation on a topological space Ω , τ . Let c L be some element. Then the restriction v c given by v c ( a ) = v ( a c ) is an integer valued valuation. If v ( c ) < then v c given by v c a = v ( a ) v ( a c ) is also an integer valued valuation and v = v c + v c . If v is continuous, then v c and v c are continuous.
 Proof. 
The strictness of v c and v c are obvious. The monotonicity of v c is obvious. To see that v c is monotone, let a b be some elements in the lattice. Then
v a c v a c ,
v a + v c v a c v b + v c v b c ,
v a v a c v b v b c ,
v c a v c b .
Modularity of v c is a simple calculation. Modularity of v c follows because v c = v v c . Continuity is obvious. □
Let v denote a valuation on a topological space Ω , τ . Then a L is called an atom of the valuation if b a implies that v ( b ) = 0 or v ( b ) = v ( a ) . An atomic valuation is a valuation that is a sum of valuations for which L is an atom.
 Proposition 1.
Any integer valued valuation on a topological space Ω , τ is atomic.
 Proof. 
The proof is by induction on n = v ( Ω ) . First, the results hold for the trivial valuation with v ( Ω ) = 0 . Assume that the result holds for any valuation with v ( Ω ) n . Let v be a valuation with v ( Ω ) = n + 1 . If L is an atom, then v is atomic. If Ω is not an atom, then there exists c L such that 0 < v ( c ) < v ( Ω ) = n . Then v c and v c are atomic valuations and v = v c + v c implying that v is atomic. □
 Theorem 3.
Let A , τ denote a finite topological space. If the topology satisfies the separation condition T 0 , then for any integer valued valuation v on the lattice of open sets there exists a uniquely determined empirical measure μ on A such that for any open set B we have
v B = μ B .
 Proof. 
We have to prove that if A is an atom for the valuation v, then v is given by a uniquely determined Dirac valuation. Let A denote a minimal atom. Then
v ( B ) = v ( B A ) + v ( B A ) v ( A ) = v ( A ) + v ( B A ) v ( A ) = v ( B A ) = v ( A ) , if A B ; 0 , else .
For a A let a denote the smallest open set that contains a. Then there must exist a A such that v a = v V , since, otherwise, one would have V A = 0 . Hence we have A a , which implies that A = a . Hence,
v ( B ) = v ( A ) , if a B ; 0 , else . = δ a B .
Assume that δ a 1 = δ a 1 . Then, δ a 1 ( B ) = δ a 1 ( B ) for all open sets B. Hence, a 1 B if and only if a 2 B . Since the topology satisfies T 0 , we must have a 1 = a 2 . □
As a consequence of the theorem, any integer-valued valuation v on a finite distributive lattice can be represented by a finite set Ω with a topology such that the multiset obtained by mapping Ω to equivalence classes A satisfies (33).
For applications in statistics, the main example of a lattice is the topology of a complete separable metric space.
 Theorem 4.
Let μ denote an integer valued valuation on the topology of a complete separable metric space. Then μ is a simple valuation, i.e., there exists integers s 1 , s 2 , , s n N 0 and letters a 1 , a 2 , , a n A such that
μ = i = 1 n s i · δ a i .
 Proof. 
Based on a result of Topsøe [[23] Thm. 3] Manilla has proved that any valuation on the topology of a metric space can be extended to a measure μ on the σ -algebra of Borel sets of the metric space [[24] Cor. 4.5]. When the metric space is separable and complete, we may, without loss of generality, assume that the metric space is B = 0 , 1 and let B denote the identity on 0 , 1 . Then F x = μ X x is increasing and integer-valued. Therefore, F is a staircase function with a finite number of steps. The measure P will have a point mass at each step and no mass in between. Hence, μ is simple. □

3.4. Lossless Compression of Data

Bayesian statistics focus on single outcomes of experiments and the frequentist interpretation focus on infinite i.i.d. sequences. Information theory takes a position in between. In information theory, the focus is on extendable sequences rather than on finite or infinite sequences [25]. In lossless source coding, we consider a sequence of observations in the source alphabet A , i.e., an observation is an element in A n where n is some natural number. We want to encode the letters in the source alphabet by sequences of letters in an output alphabet B of length β . In lossless coding, the encoding should be uniquely decodable. Further, the encoding should be so that as concatenation of source letters is encoded as the corresponding concatenation of output letters. We require that not only A n is uniquely decodable, but any (finite) string in A * should be encoded into a string in B * in a unique way. If the code κ : A B * is uniquely decodable then it satisfies Kraft’s inequality
a A β κ a 1
where κ a denotes the length of the code word κ a [[26] Thm. 5.2.1]. Instead of encoding single letters in A into B * we may do it as block coding where a block in A n is mapped as a string B * via a mapping κ . The following theorem is a kind of reverse of Kraft’s inequality for block coding.
 Theorem 5
([25]). Let : A R denote a function. Then the function ℓ satisfies Kraft’s inequality 42 if and only if for all ϵ > 0 there exists an integer n and a uniquely decodable code κ : A n B * such that
¯ κ a 1 n 1 n i = 1 n a i ϵ
where ¯ κ a 1 n denotes the length κ a 1 n divided by n.
Using this theorem, we can identify uniquely decodable codes with code length functions that satisfy Kraft’s inequality. There is a correspondence between codes and sub-unital measures given by
μ a = β κ a .
Now, the goal in lossless source coding is to code with a code-length that is as short as possible. If a code word has empirical measure μ and the function is used then the total code length is
a ( a ) μ ( a ) .
Our goal is to minimize the total code length and this is achieved by the code length function
( a ) = log β μ ( a ) n .
If a code with this code length function is used then the total code length will be n · H μ n where H denotes the Shannon entropy of a probability measure. We can define the entropy of any finite discrete measure by
H μ = μ A H μ μ A .
With this definition one can easily prove that if g : A B is a measurable mapping and μ is a measure on A then the following chain rule holds
H μ = H g μ + b B H μ g 1 b .
The chain rule reflects that coding the result in A can be done by first coding the result in B and then code the result in A restricted to subset of letters in A that maps into the observed letter in B .
One could proceed on exploring the correspondence between measures and codes as is done in the minimum description length (MDL) approach to statistics, and much of the content of Section 4 could be treated as using MDL. For instance, the number of source letters of length n is α n and it grows exponentially, but the number of different multisets of size n is upper bounded by ( n + 1 ) α and it grows like a polynomial in n. This fact has important consequences related to the maximum entropy principle and in the information theory literature [27] it is called the method of types where type is another word for multiset.
In order to emphasize the distinction between probability measures and expectation measures, we will not elaborate on this approach in the present exposition. Here we will just emphasize that Kraft’s inequality and Equations (46) are some of the few examples where a theorem states that unital measure play a special role beyond the fact that all finite measures can be normalized.

3.5. Lossy Compression of Data

If an information source is compressed to a rate below the entropy of the source letter cannot be reconstructed from the output letters. In this case one will experience a loss in description of the source and methods for minimizing the loss are handled by rate distortion theory. In rate distortion theory, we introduce a distortion function d : A × A ^ R that quantified how much is lost if a A was sent, and it was reconstructed as A ^ . If A ^ A then d may be a metric or a function of a metric. As demonstrated in the papers [28,29,30] many aspects of statistical analysis can be handled by rate distortion theory. This involves cluster analysis, outlier detection, testing Goodness-of-Fit, estimation, and regression. Important aspects of statistics can be treated using rate distortion theory, but in order to keep the focus on the distinction between probability measures and expectation measures, we will not go into further details regarding rate distortion theory.

4. Expectations

Multisets, empirical measures, and integer valued valuations are excellent for descriptive statistics, but these concepts neither describe randomness, sampling, nor expectations. In this section, we will discuss more general categories where these concepts are incorporated.

4.1. Simple Expectation Measures

Let Ω denote a large outcome space and let μ ω denote the empirical measure on the set A if the outcome is ω . The empirical measure μ ω can be described as a list of frequencies or as a multiset. Assume that the size of the multiset is N = μ ω A . The measure μ ω may describe a sample from a population, but it may also describe the whole population, in which case the subscript ω is not needed. In modern statistics various resampling techniques play an important role, and for this reason we will keep the subscript in order both to describe sampling and resamling. Now we take a sample of size n from the population.
The simplest situation is when n = 1 . If B A then 1 N · μ ω B is the probability that a randomly selected point from the multiset described by μ ω belongs to B. The unital measure 1 μ ω A · μ ω is the empirical distribution. The empirical measure μ ω gives a table of frequencies and the empirical distribution gives a table of relative frequencies.
Next, consider the situation when we take a sample of size n > 1 from the population. There are different ways of taking a sample of size n. One may sample with replacement or without replacement. These two basic sampling schemes are described by the multinomial distribution and by the multivariate hyper-geometric distribution respectively. For both sampling methods, the mean number of observations in a set B is given by n N · μ ω B . Thus, the expected values are described by the measure α · μ where α = n / N . Here we have scaled the measure μ ω by a factor α 0 , 1 , and this leads to a measure that is not given by a multiset.
Consider a sample described by an empirical measure μ ω with sample size N = μ ω A . For cross validation, we may randomly partition the sample into a number of subsamples. One may then check if a conclusion based on a statistical analysis of one of the subsamples is the same as if another subsample had been taken. If there are k random subsamples, then the expected number of observations in B is 1 k · μ ω B and the random subsample may be described by the measure α · μ ω where α = 1 / k .
In bootstrap re-sampling one selects n objects from a sample of size n but this is done with replacement. If the sample is described by the measure μ ω , then the mean number of observations in B is μ ω B . Thus, bootstrap re-sampling corresponds to the measure α · μ ω where α = 1 . We see that although multiplying by 1 does not change the measure, it may reflect a non-trivial re-sampling procedure.
In general, we may perform α -thinning of a multiset. This is done by keeping each point with probability α and deleting it with probability 1 α . The preservation/deletion of observations is done independently for each observation. For values of α in 0 , 1 Q we can implement α -thinning using a random number generator that gives a uniform distribution on a finite set. Such random numbers can be created by rolling a die, draw a card from a deck, or a similar physical procedure. The set 0 , 1 Q is not complete, which is inconvenient for formulating various theorems. Therefore, we will also allow irrational values of α so that α can assume any value in 0 , 1 .
We discussed addition of measures in the previous section. In particular, we may add n copies of the measure μ ω together to obtain the measure n · μ ω . Then we may sample from n · μ ω by thinning by some factor α 0 , 1 so that we obtain the measure α · n · μ ω = α · n · μ ω . In this way, we may multiply a measure by any positive number. In general, there will be many different ways of implementing a multiplication by the positive number t:
  • There are many ways of writing t as a product α · n where α 0 , 1 and n is an integer.
  • There are many different sampling schemes that will lead to a multiplication be α .
  • There are many ways of generating the randomness that is needed to perform the sampling.
Although there are many ways of implementing a multiplication of the measure μ ω by the number t 0 , it is often sufficient to know the resulting measure t · μ ω rather than details about how the multiplication is implemented. In Section 4.3 we will introduce a kind of default way of implementing the multiplication.
By allowing multiplication by positive numbers we can obtain any finite measure concentrated on a finite set, i.e., measures of the form
μ = s a · δ a .
The set M + f i n A is defined as the finitely supported finite measures on A . The finitely supported finite measures are related to the empirical measures in exactly the same way as probability measures are related to truth assignment functions in logic.

4.2. Categorical Properties of the Expectation Measures and Some Generalizations

We want to study a monad that allow us to work with both empirical measures and unital measures as done in probability theory. The set M + f i n A is defined as the finitely supported finite measures on A . If P M + f i n M A is a probability measure and the outcome space is Ω = M A then ω μ ω is formally a point process with points in A . The expectation measure π P of the point process ω μ ω is given by M + f i n M A to M + f i n A by
π P f = M A A f a μ ω a d P ω .
By linearity the transformation π defined by Equations (50) extends to a natural transformation from M + f i n M + f i n A to M + f i n A .
As before, we let δ denote the natural transformation that maps a A into δ a , i.e., the Dirac measure concentrated in a. It is straight forward to check that M + f i n , δ , π forms a monad. The Kleisli morphisms generated by this monad will generate a category that we will call the finite expectation category, and we will denote it by Fin.
Finite lattices can be represented by finite topological spaces, and for these spaces the theory is simple.
 Theorem 6.
Let A , τ denote a finite topological space. If the topology satisfies the separation condition T 0 , then for any finite valuation v on τ there exists a finite expectation measure μ on the Borel σ-algebra A such that for any open set B we have
v B = μ B .
 Proof. 
The proof is an almost a word by word repetition of the proof of Theorem 3. □
All finite expectation measures are continuous valuations on a topological space. The monad of continuous valuations on topological spaces is important because it includes all probability measures on complete separable metric spaces, that is the most used model in probability theory.

4.3. The Poisson Interpretation

Let μ denote a discrete measure such that
μ = a A s a · δ a
where A = { a μ a > 0 } . Then the Poisson point process P o μ given by the product
a A P o s a
is a point process with expectation measure μ . Thus, any discrete measure has an interpretation as the expectation measure of a discrete Poisson point process (see Section 2.5). This interpretation will be called the Poisson interpretation, and it can be used to translate properties and results for (non-unital) expectation measure into properties and results for probability measures.
A. Rényi was the first to point out that a Poisson process has extreme properties related to information theory [31] and entropy for point processes were later studied by Mc Fadden [32]. Their results were formulated for simple point processes. Here we will look at some results for processes supported on a finite number of points. If we thin a one-point process, We will get a process given by the expectation measure P = i p i · δ i where i p i 1 and 1 i p i is the void probability, i.e., the probability of getting no point. Here we shall just present two results that are generalizations of similar results in [33,34,35]. We say that a point process is a multinomial sum if it is a sum of independent thinned one point processes. Let B e μ denote the set of sums of independent thinned one-point processes with expectation measure μ and let B e * μ denote the total variation closure of B e μ . As a consequence of Theorem 2 the distribution P o μ lies in B e * μ . The following results can be proved in the same way as a corresponding result in [33] was proved.
 Theorem 7.
The maximum entropy process in B e * μ is the Poisson point process P o μ .
The following result states that a homogeneous process has greater entropy than an inhomogeneous process with the same mean number of points. This is a point process version of the result that the uniform distribution is the distribution with maximal entropy on a finite set.
 Theorem 8.
Let A be a set with m elements. Then the Poisson point process P o μ that has maximum entropy under the condition that μ A = λ , is the process where μ a = λ / m for all a A .
 Proof. 
We note that
H P o μ = H a P o μ a = a H P o μ a
so it is sufficient to prove that the sum is Shur concave under the condition that a μ a = λ .
Let μ 1 , μ 2 > 0 and let X i P o μ i then
H X 1 + H X 2 = H X 1 , X 2 = H X 1 + X 2 + H X 1 , X 2 | X 1 + X 2 = H P o μ 1 + μ 2 + n = 0 H b i n n , μ 1 μ 1 + μ 2 · μ 1 + μ 2 n n ! exp μ 1 + μ 2 ,
where b i n n , p means the binomial distribution with number parameter n and success probability p. For a fixed value of μ 1 + μ 2 we have to maximize H b i n n , μ 1 μ 1 + μ 2 . The maximum is achieved when μ 1 μ 1 + μ 2 = 1 2 , i.e., μ 1 = μ 2 . To see that H b i n n , p is maximal for p = 1 / 2 it is sufficient to note that p H b i n n , p is a concave function which follows from [36]. □
The Poisson interpretation does not only work for finite expectation measure. The following result is relevant for valuations on topological spaces.
 Theorem 9.
Let μ denote a σ-finite measure on a complete separable metric space. Then μ is an expectation measure of a Poisson point process.
 Proof. 
Since the measure μ is σ -finite, it can be written as μ = i = 1 μ i , where μ i is a finite measure for all i. Each μ i can be written as a sum of a discrete measure and a continuous measure. The discrete measure is the expectation measure of a discrete product of Poisson distributions. The continuous measure is the expectation measure of a Poisson point process. Each of these are examples of simple Poisson point processes. The result follows because a countable sum of Poisson point processes is a Poisson random process. □
It is also possible to construct Poisson point processes on finite topological spaces.
 Theorem 10.
Let v denote a valuation on the topology of a finite set A . Assume that the topology satisfies the separation condition T 0 . Then there exists an outcome space Ω with a probability measure P and a transition kernel ω μ ω from Ω to A such that
  • μ ω B is Poisson distributed for any open set B.
  • For any open sets A , B , C the random variable μ ω A is independent of the random variable μ ω B given the random variable μ ω C if and only if A B C .
Further for any open set B we have
v B = Ω μ ω B d P ω .
 Proof. 
First we determine the measure μ on A such that v B = μ B for any open set B. Then we construct independent Poisson distributed random variables X a such that X a P o μ a . Then, for any open set B, we define a random variable
Y B = a B X a .
With these definitions
E Y B = E a B X a = a B E X a = a B μ a = μ B = v B .
The conditional independence follows from the construction. □
It is worth noting that for any lattice, the relation A B C defines a separoid relation (abstract notion of independence) if and only the lattice is distributive. For a distributive lattice the relation A B C can also be written as A C B C = C , and this relation is separoid if and only the lattice is modular (see [37] and [38] Cor. 2].

4.4. Normalization, Conditioning and Other Operations on Expectation Measures

Empirical measures can be added, one can take restrictions and one can find induced measures. Using the same formulas these operations can be performed on expectation measures, but we are not only interested in the formulas but also in probabilistic interpretations.
First we note that any σ -finite measure μ is the expectation measure of some point process. Assume that μ = i = 1 μ i for some finite measures μ i . Then
μ = i = 1 1 2 i · ν i
where ν i = 2 i · μ i are finite measures. Thus, μ is a probabilistic mixture of finite measures. Therefore, we just have to prove that any finite measure ν is the expectation measure of a point process. The normalized measure ν / ν A has an interpretation as a probability measure, which is the same as a one-point process. We may add n copies of this process to obtain a process with expectation measure n · ν / ν A . If n ν A , then this process can be thinned to get a process with expectation measure ν . The following proposition gives probabilistic interpretations of addition, restriction, and inducing for expectation measures via the same operations applied to empirical measures. These equations are proved by simple calculations.
 Proposition 2.
Let Ω , F , P be a probability space. Let ω μ ω and ω ν ω denote point processes with expectation measures μ and ν and with points in A . Let A be a subset of A and let f : A B be some mapping. Then
μ + ν = μ ω + ν ω d P ω ,
μ A = μ ω A d P ω ,
f μ = f μ ω d P ω .
respectively.
If μ is an expectation measure on the set A , then 1 μ A · μ is a unital measure that we will call the normalized measure. Unital measures are normally called probability measures, and our aim is to give a probabilistic interpretation of the normalized measure 1 μ A · μ by specifying an event that has probability equal to μ B μ A .
 Theorem 11.
Let B be a subset of A . Let P denote a probability measure on Ω and assume that ω μ ω is a Poisson point process with expectation measure μ. Then
μ B μ A = μ ω B μ ω A d P ω | μ ω A > 0 .
 Proof. 
We take the mean of the empirical distribution with respect to P.
μ ω B μ ω A d P ω | μ ω A > 0 = n = 1 μ ω B μ ω A d P ω | μ ω A = n · P μ ω A = n | μ ω A > 0 = n = 1 μ ω B n d P ω | μ ω A = n · P μ ω A = n | μ ω A > 0 = n = 1 μ ω B d P ω | μ ω A = n n · P μ ω A = n | μ ω A > 0 .
Now the random variable μ ω B is Poisson distributed with mean μ ω B and the random variable μ ω B is Poisson distributed with mean μ ω B and these two random variables are independent. When we condition on μ B + μ B = n the distribution of μ ω B is binomial with mean n · μ B μ A . Hence
μ ω B μ ω A d P ω | μ ω A > 0 = n = 1 n · μ B μ A n · P μ ω A = n | μ ω A > 0 = n = 1 μ B μ A · P μ ω A = n | μ ω A > 0 = μ B μ A .
The theorem states that μ B μ A is the probability of observing a point in B if one first observe a multiset of points as an instance of a point process and then randomly select one of the points from the multiset. This is not very different from random matrix theory, where one first calculate all eigenvalues of a random matrix and then randomly selected one of the eigenvalues. Wigner’s semicircular law states that such a randomly selected eigenvalue from a large random matrix has a distribution that approximately follows a semicircular law.
Proposition 2 holds for all point processes, but in Theorem 11 it is required that the point process is a Poisson point process. The following example shows there are point processes where Equations (63) does not hold.
 Example 4.
Let Ω = { ω 1 , ω 2 } and assume that P ω 1 = P ω 2 = 1 / 2 . Let A = { a , b } , and let ω μ ω denote a process where μ ω 1 = 1 · δ b and μ ω 2 = 2 · δ a + 1 · δ b . The expectation measure of this process is μ = 1 · δ a + 1 · δ b . Let B = { b } . Then the left-hand side of Equations (63) evaluates to
μ B μ A = 1 2 .
The right-hand side of Equations (63) evaluates to
μ ω B μ ω A d P ω | μ ω A > 0 = 1 · 1 2 + 1 3 · 1 2 = 2 3 .
The Poisson interpretation of normalized expectation measures carries over to conditional measures.
 Corollary 1.
Let P denote a probability measure on Ω and assume that ω μ ω is a Poisson point process with expectation measure μ. Let A and B be subsets of A with μ A > 0 . Then
μ B | A = μ ω B | A d P ω | μ ω A > 0 .
 Proof. 
A conditional measure is the normalization of an expectation measure restricted to a subset.
μ B | A = μ B A μ A = μ A B μ A A .
The corollary is proved by applying Theorem 11 to the measure μ A . The condition μ A > 0 will ensure that P μ ω A > 0 > 0 . □
Let μ be an expectation measure on A 1 and let g be a map from A 1 to A 2 . Then the induced measure g μ is also an expectation measure. If g μ a 2 > 0 then a Markov kernel μ · | · from A 2 to A 1 is given by
μ a 1 g a = a 2 = μ g 1 a 2 a 1 g μ a 2 .
With this Markov kernel the measure μ can be factored as
μ a 1 = μ a 1 g a = a 2 · g μ g a 2 .

4.5. Independence

The notion of independence plays an important role in the theory of randomness, so we need to define this notion in the context of expectation measures.
 Definition 3.
Assume that μ is a measure on A . For i = 1 , 2 let g i denote the mappings A A i . Then we say that g 1 is independent of g 2 if g 1 μ g 2 a = a 2 does not depend on a 2 A 2 .
 Theorem 12.
Let μ be a measure on A = A 1 × A 2 with projections g 1 and g 2 . Then g 1 is independent of g 2 if and only if
μ ω 1 , ω 2 = μ 1 ω 1 · μ 2 ω 2 μ A
where μ 1 and μ 2 are the marginal measures on A 1 and A 2 respectively.
 Proof. 
We have
μ ω 1 , ω 2 = μ ω 1 g ω = ω 2 · μ g ω = ω 2 = μ ω 1 g ω = ω 2 · μ 2 ω 2 .
Let μ ˜ denote the unital measure on A 1 given by μ ˜ ω 1 = μ ω 1 g ω = ω 2 . Then
μ ω 1 , ω 2 = μ ˜ ω 1 · μ 2 ω 2 ,
μ 1 ω 1 = ω 2 A 2 μ ω 1 , ω 2 = ω 2 A 2 μ ˜ ω 1 · μ 2 ω 2 = μ ˜ ω 1 · μ 2 A 2 = μ ˜ ω 1 · μ A ,
μ 1 ω 1 μ A = μ ˜ ω 1 .
Note that Equations (72) is the standard way of calculating expected counts in a contingency table under the assumption of independence. Note also that Equations (72) can be rewritten as
μ ω 1 , ω 2 μ A = μ 1 ω 1 μ 1 A 1 · μ 2 ω 2 μ 2 A 2 ,
which is the well-known equation that states that for independent variables the joint probability is the product of the marginal probabilities.

4.6. Information Divergence for Expectation Measures

Let P and Q be discrete probability measures. Then Kullback-Leibler divergence is defined by
D P Q = i P i ln P i Q i , if P Q ; , else .
For arbitrary discrete measures μ and ν we define information divergence by extending Equations (78) via the following formula
D μ ν = i μ i ln μ i ν i μ i + ν i .
With this definition information divergence becomes a Csiszár f-divergence, and it gives a continuous function from the cones of discrete measures to the lower reals 0 , .
 Proposition 3.
Let μ and ν denote two finite measures. Then
D P o μ P o ν = D μ ν .
Note that on the left-hand side is a KL-divergence for probability measures, while the right-hand side is an information divergence for expectation measures.
 Proof. 
First assume that the outcome space is a singleton so that μ and ν are elements in 0 , . Then
D P o μ P o ν = i = 0 P o i , μ ln P o i , μ P o i , ν = i = 0 P o i , μ ln μ i i ! exp μ ν i i ! exp ν = i = 0 P o i , μ i ln μ ν μ + ν = μ ln μ ν μ + ν .
The results follows because KL-divergence is additive on product measures. □
Information divergence is a Csiszár f-divergence, but it is also a Bregman divergence defined on the cone of discrete measures, and except for a constant factor it is the only Bregman divergence that is also a Csiszár f-divergence. On an alphabet with at least three letters KL-divergence may (except for a constant factor) also be characterized as the only Bregman divergence that satisfies a data processing inequality for Markov kernels of unital measures, and there are a number of equivalent characterizations [39], if the alphabet has at least three letters. Here we focus on the convex cone of measures rather than the simplex of probability measures. Therefore it is interesting to note that information divergence has a characterization on a one letter alphabet as a Bregman divergence that satisfies the following property called 1-homogenuity.
D α · μ α · ν = α · D μ ν .
 Theorem 13.
Assume that d : R + 2 R 0 , + is a function that satisfies the following conditions.
  • d x , y 0 with equality when x = y .
  • i = 1 n d x i , y is minimal when y = x ¯ .
  • d α · x , α · y = α · d x , y for all α > 0 .
Then there exists a positive constant c such that d x , y = c · x ln x y x y .
 Proof. 
The first two conditions imply that d is a Bregman divergence [39] Proposition 4]so there exists a strictly convex function g such that d x , y = g x g y + x y · g y . According to property 3. we have
d α · x , α · y = α · d x , y ,
g α · x g α · y + α · x α · y · g α · y = α · g x g y + x y · g y .
We differentiate with respect to y and get
α · g α · y α · g α · y α 2 · y · g α · y = α · g y + g y + y · g y ,
α 2 · y · g α · y = α · y · g y ,
α · y · g α · y = y · g y .
This holds for all α , y > 0 so g y · y = c for some constant c > 0 .
g y · y = c ,
g y = c y ,
g y = c · ln y ,
g y = c · y ln y y + k ,
for some constant k . Hence
d x , y = c x ln x x + k c y ln y y + k + x y · c · ln y = c · x ln x x + k y ln y y + k + x y · ln y = c · x ln x x y ln y + y x y · ln y = c · x ln x y x + y .
The following theorem can be proved in the same way as similar theorems in [34,35].
 Theorem 14.
Let P be a Bernoulli sum on M A with expectation measure π P = μ . Then
D T 1 / n P * n P o μ 0
for n .
Let C denote a convex set of measures. Then D C ν is defined as inf μ C D μ ν . If D C ν < and the measure μ * C satisfies D μ * ν = D C ν then μ * is called the information projection of ν on C [40,41,42].
 Proposition 4.
Let ν be a measure of a finite alphabet A , and let C be a convex set of measures on A . If μ * is the information projection of ν on C then P o μ * is the information projection of P o ν on the convex hull of the probability measures of the form P o μ where μ C .
 Proof. 
The measure μ * is the information projection of ν if and only if the following Pythagorean inequality
D μ ν D μ μ * + D μ * ν
is satisfied for all μ C [[42] Theorem 8]. Now we have
D P o μ P o ν = D μ ν D μ μ * + D μ * ν = D P o μ P o μ * + D P o μ * P o ν .
Since a Pythagorean inequality is satisfied for P o μ * , it must be the information projection of P o ν on the convex hull of the distributions P o μ where μ C . □
The reversed information projection is defined similar to the definition of the information projection [2,43,44,45]. If C is a convex set, then D μ C is defined as inf ν C D μ ν . If D μ C < then ν ^ C is said to be the reversed information projection of μ on C if D μ ν ^ = D μ C .
 Proposition 5.
Let μ be a measure of a finite alphabet A , and let C be a convex set of measures on A . Assume that ν ^ C and that D μ C < . Then the probability measure P o ( ν ^ ) is the reverse information projection of P o μ on the convex hull of the set of probability measures { P o ( ν ) ν C } .
 Proof. 
According to the so-called four point property by Csiszár and Tusnády [43] the measure ν ^ is the reverse information projection of μ on C if and only if
ω Ω μ ω ν ^ ω · ν ω μ ω + ν ^ ω ν ω 0 .
for all ν in C . The probability measure P o ( μ ) has outcome space N 0 Ω . For j N 0 Ω we will write P o μ , j as short for ω Ω P o P ( ω ) , j ω . We calculate
j N 0 Ω P o μ , j P o ν ^ , j · P o ν , j = j N Ω ω Ω P o μ ω , j ω P o ν ^ ω , j ω · P o ν ω , j ω = ω Ω j N 0 P o μ ω , j P o ν ^ ω , j · P o ν ω , j = ω Ω j N 0 μ ω j j ! exp μ ω ν ^ ω j j ! exp ν ^ ω · ν ω j j ! exp ν ω = ω Ω j N 0 ν ω ν ^ ω · ν ω j j ! exp μ ω + ν ^ ω ν ω = ω Ω exp μ ω ν ^ ω · ν ω exp μ ω + ν ^ ω ν ω = exp ω Ω μ ω ν ^ ω · ν ω μ ω + ν ^ ω ν ω exp 0 = 1 .
Therefore
j N 0 Ω P o μ , j P o ν ^ , j · P o ν , j P o μ , j + P o ν ^ , j P o ν , j 1 1 + 1 1 = 0
for all ν C . □

5. Applications

In this section we will present some examples of how expectation measures can be used to give a new way to handle some problems from statistics and probability theory.

5.1. Goodness-of-Fit Tests

Here we will take a new look at testing Good-of-Fit in one of the simplest possible setups. We will test if a coin is fair, and we perform an experiment where we count the number of heads and tails after tossing the coin a number of times. Let X denote the number of heads and let Y denote the number of tails. Our null hypothesis is that there is symmetry between heads and tails. Here we will analyze the case when we have observed X = and Y = m .
Typically, one will fix the number of tosses so that X + Y = n , and assume that X has a binomial distribution with success probability p . First we will look at an example where the null hypothesis states that p = 1 / 2 and n = 20 .
The maximum likelihood estimate of p is / n . The divergence is
D b i n n , / n b i n n , 1 / 2 = n · n ln / n 1 / 2 + m n ln m / n 1 / 2 .
We introduce the signed log-likelihood as
G n x = 2 · D b i n n , x / n b i n n , 1 / 2 1 / 2 , if x < n / 2 ; + 2 · D b i n n , x / n b i n n , 1 / 2 1 / 2 , if x n / 2 .
so that
D b i n n , / n b i n n , 1 / 2 = 1 2 G n x 2
In [[46] Cor 7.2] it is proved that
Pr X < k Φ G n k Pr X k
where Φ denotes the distribution function of a standard Gaussian distribution. A QQ plot with a Gaussian distribution on the first axis and the distribution of G n X on the second axis one gets a staircase function with horizontal steps each intersecting the line x = y corresponding to a perfect match between the distribution of G n X and a standard Gaussian distribution as illustrated in Figure 2.
Instead of using the Gaussian approximation one could calculate tail probabilities exactly (Fisher’s exact test), but as we shall see below one can even do better.
In expectation theory, it is more natural to assume that X and Y are independent Poisson distributed random variables with mean values λ and μ respectively. In our analysis, the null hypothesis states that λ = μ .
Since X + Y P o λ + μ the maximum likelihood estimate of λ + μ is + m . Hence the estimate of λ and μ are + m / 2 . Here we define the random variable N = X + Y and n = + m . We calculate the divergence
D P o P o m P o n 2 P o n 2 = ln n / 2 + m ln m n / 2 .
i.e., the same expression as in the classical analysis. Since X is binomial given that X + Y = n we have
Pr X < k N = n Φ G n k Pr X k N = n ,
Since the distribution of G N X is close to a standard Gaussian distribution under the condition N = n the same is true for G N X when we take the mean value over the Poisson distributed variable N . Since each of the steps intersects the straight line near the mid-point of the step the effect of taking the mean value with respect to N is that the steps to a large extent cancel out as illustrated in Figure 3.
The only step that partly survive the randomization is the step G X = 0 . For the binomial distribution, the length of this step is determined by P X = n / 2 = 0.1762 . If the sample size is Poisson distributed then the length of the step is determined as
E P X = N / 2 = n even 20 n n ! exp 20 · n ! n 2 ! 2 2 n
= 0.0898 ,
which is about half of the value for the binomial distribution. The reason for this is that the probability that a Poisson distributed random variable is even is approximately 1 / 2 . If we tested p = 1 / 3 we would get a step of length about one third of the corresponding step for the binomial distribution. In a sense, testing p = 1 / 2 gives the most significant deviation from a Gaussian distribution.
If we square G X , we get 2 times divergence, which is often called the G 2 -statistic. Due to symmetry between head and tail, the intersection property is also satisfied when the distribution of the G 2 -statistic is compared with a χ 2 -distribution [47]. This is illustrated in Figure 4.
For statistical analysis, one should not fix the sample size before sampling. A better procedure is to sample for a specific time so that the sample size becomes a random variable. In practice, this is often how sampling takes place and if the sample size is really random, it may even be misleading to analyze data as if n was fixed.

5.2. Improper Prior Distributions

Prior distributions play a major role in Bayesian statistics. A detailed discussion about how prior distributions can be determined in various cases is beyond the topic of this article. We will refer to [49] for a review of the subject including a long list of references. See [50] for a more information theoretic treatment of prior distributions.
In Bayesian statistics, a justification of how posterior distributions are calculated is normally based on a probabilistic interpretation of the prior distribution. It is well-known that proportional prior distributions will lead to the same posterior distribution. For this reason the total weight of the prior distribution is not important as long as the total weight is finite. With a finite total weight the total weight can always be normalized in order to obtain a unital measure, and unital measures allow a probabilistic interpretation within Bayesian statistics. A significant problem in Bayesian statistics is the use of improper prior distributions, i.e., prior distributions described by measures with infinite total mass. Such a prior is problematic in that it does not allow a literal interpretation in Bayesian statistics. If we replace probability measures by expectation measures, this problem disappears. We will just give a simple example of how improper prior distributions can be given probabilistic interpretation using the Poisson interpretation of expectation measures.
Consider a statistical model where a random variable X has distribution given by the Markov kernel
P X = μ 1 = P X = μ + 1 = 1 / 2 .
We assume that the mean value parameter μ is integer valued. We choose the counting measure times λ on Z as an improper prior distribution for the mean value parameter. There are a number of ways or arguing in favor of this prior distribution. For instance the counting measure is invariant under integer translations. Except for a multiplicative constant translation invariance uniquely determines the prior measure as a Haar measure. Combining λ times the counting measure on Z with the Markov kernel (107) we get a joint measure on Z 2 supported on the points illustrated in Figure 6.
Each point supporting the joint measure will have weight λ / 2 . The marginal measure on X is λ times the counting measure on Z as discussed in Section 4.4. Now the joint measure can be factored into λ times the counting measure on Z and a Markov kernel
P M = x 1 = P M = x + 1 = 1 / 2 .
The Markov kernel (108) can be considered as the posterior distribution of M given that X = x .
These calculations can be justified within a probabilistic setting via point processes. First, the prior is represented by a Poisson point process with λ times the counting measure on Z as expectation measure. If the Markov kernel (107) is applied to this process, then we obtain a Poisson point process on the points in Z 2 illustrated in Figure 6 with λ / 2 times the counting measure as expectation measure. For an instance of this joint point process let N x denote the number of points with x as second coordinate. We will condition on X = x so we will assume that N x = n > 0 . Among these n points that all have second coordinate equal to x, we choose a random point according to a uniform distribution, i.e., each point is selected with probability 1 / n . The selected point has coordinates M , x where M = x ± 1 . We want to calculate the distribution of M for a given value of x . According to Corollary 1 we get P M = x ± 1 X = x = 1 / 2 and this is our posterior distribution.
This derivation works for any value of λ > 0 . In particular we will get the same result if we thin the process by a factor α > 0 , i.e., if we replace λ by α · λ . The derivation involves selecting 1 out of N x points under the condition that N x 1 . If N x P o α · λ we have
P N = 1 N 1 = α · λ exp α · λ 1 exp α · λ = α · λ exp α · λ 1 1
for α 0 . Therefore, with high probability there will only be one point that has second coordinate x if the point process has been thinned with a small value of α . Hence, the step, where we randomly select one of the points, becomes almost obsolete if we thinned the process sufficiently.
If the Poisson point process is a process in time, then there will be a first point for which X = x and the probability of M = m will be the probability that this first point satisfies M = m . With a process in time, one can introduce a stopping time that stops the process when X = x has been observed for the first time. This often simplifies the interpretation, but exactly this kind of reasoning goes wrong in many presentations of the double-slit experiment (Example 3). In this experiment, the point process is not a process in time because no arrival times for the photons are recorded. If one had precise records of time, the uncertainty for energy and frequency would destroy the interference pattern.

5.3. Markov Chains

The idea of randomizing the sample size has various consequences, and sometimes this leads to great simplifications. Let Φ denote some Markov operator that generates a Markov chain. The time average is usually defined as
Φ n = 1 n i = 0 n 1 Φ i
Many properties of such time averages are known, but what complicates the matter is that the composition of Φ m and Φ n is not a time average. If we instead define
Ψ t = i = 0 t i i ! exp t · Φ i
then Ψ t is a Markov operator and Ψ s Ψ t = Ψ s + t . The Markov operators Ψ t generate a Markov process in continuous time, and this Markov process tend to have better properties than the original Markov chain. For instance, all recurrent states under Φ get positive transition probabilities under Ψ . The same idea was also used in [51] to prove that if Φ is an affine map of a convex body into itself, then Ψ t converges to a retraction of the convex body to the set of fix points of the affine map Φ . Many theorems in probability involving averages should be revisited to see what the consequences are if the usual average is replaced by taking a weighted average with Poisson distributed weights.

5.4. Inequalities for Information Projections

Let Q = q 1 , q 2 , , q n be a unital measure on a finite set and let C denote the convex set of unital measures such that the mean value of f ( i ) is ν . Then the information projection P of Q on C can be determined by using Lagrange multipliers.
ln P ( i ) Q ( i ) + P ( i ) P ( i ) 1 = β · f ( i ) + γ · 1 ,
P ( i ) Q ( i ) = exp β · f ( i ) + γ ,
P ( i ) = exp β · f ( i ) · exp γ · Q ( i ) .
The moment generating function is defined by Z ( β ) = i exp β · f ( i ) · Q ( i ) and in order to get a unital measure we shall choose γ = ln Z ( β ) . Thus, P is element in the exponential family
P ( i ) = exp β · f ( i ) Z ( β ) · Q ( i ) .
The mean value is Z ( β ) Z ( β ) and β should be chosen so that this equals ν .
If we drop the condition that the projection should be unital, we get simplified expressions. The Lagrange equation becomes
ln P β ( i ) Q ( i ) + P β ( i ) P β ( i ) 1 = β · f ( i ) ,
P β ( i ) = exp β · f ( i ) Q ( i ) ,
and the mean value of this measure is Z ( β ) .
 Theorem 15.
Let Q be a unital measure on a finite set and let X be a random variable such that
X d Q = 0 ,
X 2 d Q = 1 ,
X 3 d Q < 0 .
Then there exists δ > 0 such that
D P Q 1 2 X d P 2
for all measures satisfying X d P 0 , δ .
 Proof. 
For a fixed value of X d P the right-hand side is minimized for the distribution P β determined by (117), so it is sufficient to prove the inequality for P = P β . Thus, we have to prove that
β · Z ( β ) Z ( β ) + β 1 2 Z ( β ) 2
for 0 β δ for some δ > 0 . We differentiate both side, and it is sufficient to prove that
β · Z ( β ) Z ( β ) · Z ( β ) ,
β Z ( β ) ,
because Z ( β ) > 0 . We differentiate once more, and we have to prove that
1 Z ( β ) .
Now we differentiate one more time, and we have to prove that
0 Z ( β ) .
for 0 β δ but this holds by continuity because Z ( 0 ) = X 3 d Q < 0 . □
Similar results hold for measures on infinite sets, but in these cases one should be careful to choose the parameters so that the information projection exists. In a number of important cases, the inequality above may be extended to all positive (or negative) values rather than positive (or negative) values in a neighborhood of zero. Some of these cases are mentioned below.
Hypergeometric distributions can be approximated by binomial distributions. A lower bound is given by the following inequality [52].
D P b i n ( n , p ) x = 0 n K ˜ 2 x ; n , p P ( x ) 2 2
where K ˜ 2 x ; n , p is the second normalized Kravchuk polynomial. For hypergeometric distribution h y p N , K , n with K N = p this lower bound can be written as
D h y p N , K , n b i n ( n , p ) n ( n 1 ) 4 N 1 2
As demonstrated in [52] this lower bound is tight for fixed values of n and p and large values of the parameters N and K.
Binomial distributions can be approximated by Poisson distributions. For any distribution P we have the following lower bound on information divergence.
D P P o λ x = 0 C ˜ 2 λ x P ( x ) 2 2
where C ˜ 2 λ x denotes the second normalized Poisson-Charlier polynomial [53]. If P is binomial b i n n , p and λ = n p then we get the inequality
D b i n n , p P o λ p 2 4
and this inequality is tight for fixed λ and n tending to infinity [35,54,55,56].
In the central limit theorem the rate of convergence is primarily determined by the skewness and if the skewness is zero it is primarily determined by the excess kurtosis. If the skewness is non-zero then lower bounds on divergence involve both skewness and kurtosis, so for simplicity we will only present the case where the skewness is zero and the excess kurtosis is negative. For all measures P for which the integral H ˜ 4 x d P 0 where H ˜ 4 denotes the fourth normalized Hermite polynomial, we have
D P N ( o , 1 ) H ˜ 4 x d P x 2 2 .
If P is a unital measure with variance excess kurtosis κ then according to [[57] Theorem 7] this inequality reduces to
D P N ( 0 , 1 ) κ 2 48 .
Lower bounds for the rate of convergence have been discussed in the papers [57,58,59,57] where this and similar inequalities were treated in great detail.
For all these inequalities, it gives simplifications if we do not require that the measures are unital. The hard part is to prove that the inequalities do not only hold in a neighborhood of zero, but that they hold for all values of interest. This hard part of the proofs is still hard without the assumption that the measures are unital, but it should be noted that the hard part is not relevant if we are only interested in lower bounds on the rate of convergence.

6. Discussion and Conclusions

Expectation theory may be considered an alternative to Kolmogorov’s theory of probability. Still, it may be better to view the two theories as two ways of describing the same situations using measure theory. The language of probability theory focuses on experiments where a single outcome is classified according to mutually exclusive classes. The language of expectation theory focuses on experiments where the results are given as tables of frequencies. There will be no inconsistency if the two languages are mixed. Expectation measures can be understood within probability theory, as is already the case in the theory of point processes. If our understanding of randomness is based on expectation measures, then an expectation measure can be interpreted as the expectation measure of a process in a larger outcome space.
In this paper we have only worked out the basic framework and interpretation of expectation theory. This opens up a lot of new research questions. For instance, it should be possible to justify the use of Haar measures as prior distributions rigorously in the same way as Haar probability measures on compact groups can be justified as being probability measures that maximize the rate-distortion function [60]. In expectation theory, it is natural to consider sampling with a randomized sample size. Since many results in probability theory are formulated for fixed sample sizes there is a lot of work to be done to generalize the results to cases with random sample sizes. Our present results suggest that simpler or stronger results can be obtained, but in many cases new techniques may be needed.
In this paper, we discussed finite expectation measures and valuations on topological spaces and downsets of posets that satisfy the DCC. The valuation approach is more general, but it is an open research question which category is most useful for further development of expectation theory.
Quantum information theory can be based on generalized probabilistic theories. In these theories a measurement is defined as something that maps a preparation into a probability measure. A state is defined as an equivalence classes of preparations that cannot be distinguished be any measurement. With a shift from probability measures to expectation measures one should similarly change the focus in quantum theory from states that are represented by density operators (positive operators with trace 1) to positive trace class operators. Thus, the focus should shift from the convex state space to the state cone. For applications in quantum information theory it is still too early to say what the impact will be, but shifting the focus away from mutually exclusive events may circumvent some of the paradoxes that have haunted the foundation of quantum theory for more than a century.
Below there is a list of concepts from probability theory and standard quantum information theory and how they relate to the concepts that have been introduced in the present paper.
Probability theory Expectation theory
Probability Expected value
Outcome Instance
Outcome space Multiset monad
P-value E-Value
Probability measure Expectation measure
Binomial distribution Poisson distribution
Density Intensity
Bernoulli random variable Count variable
Empirical distribution Empirical measure
KL-divergence Information divergence
Uniform distribution Poisson point process
State space State cone

Funding

This research received no external funding.

Acknowledgments

I want to thank Peter Grünwald and Tyron Lardy for stimulating discussions related to this topic.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
bin binomial distribution
DCC decending chain condition
E-statistic Evidence statistic
E-value Observed value of an E-statistic
hyp Hypergeometric distribution
IID Independent identically distributed
KL-divergence Information divergence restricted to probability measures
MDL Minimum description length
Mset Multiset
N Gaussian distribution
PM Probability measure
Po Poisson distribution
mset Multiset
Poset Partially ordered set
Pr Probability

References

  1. Kolmogorov, A.N. Grundbegriffe der Wahrscheinlichkeitsrechnung; Springer: Berlin, 1933. [Google Scholar]
  2. Lardy, T.; Grünwald, P.; Harremoës, P. Reverse Information Projections and Optimal E-statistics. IEEE Transactions on Information Theory 2024. [Google Scholar] [CrossRef]
  3. Perrone, P. Categorical Probability and Stochastic Dominance in Metric Spaces. phdthesis, Max Planck, Institute for Mathematics in the Sciences. 2018. [Google Scholar]
  4. nLab authors. monads of probability, measures, and valuations. Revision 45. 2024. Available online: https://ncatlab.org/nlab/show/monads+of+probability%2C+measures%2C+and+valuations.
  5. Shiryaev, A.N. Probability; Springer: New York, 1996. [Google Scholar]
  6. Whittle, P. Probability via Expectation, 3 ed.; Springer texts in statistics; Springer Verlag: New York, 1992. [Google Scholar]
  7. Kallenberg, O. Random Measures; Springer: Schwitzerland, 2017. [Google Scholar] [CrossRef]
  8. Lawvere. The Category of Probabilistic Mappings. Lecture notes.
  9. Scibior, A.; Z..; Ghahramani.; Gordon, A.D. Practical probabilistic programming with monads. In Proceedings of the 2015 ACM SIGPLAN Symposium on Haskell; Association for Computing Machinery: New York, NY, USA, 2015; Haskell ’15; pp. 165–176. [Google Scholar] [CrossRef]
  10. Giry, M. A categorical approach to probability theory. Categorical Aspects of Topology and Analysis; Banaschewski, B., Ed.; Springer Berlin Heidelberg: Berlin, Heidelberg, 1982; pp. 68–85. [Google Scholar]
  11. Lieshout, M.V. Spatial Point Process Theory. In Handbook of Spatial Statistics; Handbooks of Modern Statistical Methods; Chapman and Hall, 2010; chapter 16. [Google Scholar]
  12. Dash, S.; Staton, S. A Monad for Probabilistic Point Processes; ACT, 2021. [Google Scholar]
  13. Jacobs, B. From Multisets over Distributions to Distributions over Multisets. Proceedings of the 36th Annual ACM/IEEE Symposium on Logic in Computer Science; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
  14. Rényi, A. A characterization of Poisson processes. Magyar Tud. Akad. Mat. Kutaló Int. Közl. 1956, 1, 519–527. [Google Scholar]
  15. Kallenberg, O. Limits of Compound and Thinned Point Processes. Journal of Applied Probability 12, 269–278. [CrossRef]
  16. nLab authors. valuation (measure theory). 2024. Available online: https://ncatlab.org/nlab/show/valuation+%28measure+theory%29.
  17. Heckmann, R. Spaces of valuations. In Papers on General Topology and Applications; New York Academy of Sciences, 1996. [Google Scholar] [CrossRef]
  18. Blizard, W.D. The development of multiset theory. Modern Logic 1991, 1, 319–352. [Google Scholar]
  19. Monro, G.P. The Concept of Multiset. Mathematical Logic Quarterly 1987, 33, 171–178. [Google Scholar] [CrossRef]
  20. Isah, A.; Teella, Y. The Concept of Multiset Category. British Journal of Mathematics and Computer Science 2015, 9, 427–437. [Google Scholar] [CrossRef]
  21. Grätzer, G. Lattice Theory; Dover, 1971. [Google Scholar]
  22. Wille, R. Formal Concept Analysis as Mathematical Theory. In Formal Concept Analysis; Ganter, B., Stumme, G., Wille, R., Eds.; Number 3626 in Lecture Notes in Artificial Intelligence; Springer, 2005; pp. 1–33. [Google Scholar]
  23. Topsøe, F. Compactness in Space of Measures. Studia Mathematica 1970, 36, 195–212. [Google Scholar] [CrossRef]
  24. Alvarez-Manilla, M. Extension of valuations on locally compact sober spaces. Topology and its Applications 2002, 124, 397–433. [Google Scholar] [CrossRef]
  25. Harremoës, P. Extendable MDL. Proceedings ISIT 2013; IEEE Information Theory Society,, 2013; pp. 1516–1520. [CrossRef]
  26. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley, 1991. [Google Scholar]
  27. Csiszár, I. The Method of Types. IEEE Trans. Inform. Theory 1998, 44, 2505–2523. [Google Scholar] [CrossRef]
  28. Harremoës, P. Testing Goodness-of-Fit via Rate Distortion. Information Theory Workshop, Volos, Greece, 2009. IEEE Information Theory Society, 2009, pp. 17–21. [CrossRef]
  29. Harremoës, P. The Rate Distortion Test of Normality. Proceedings ISIT 2019. IEEE Information Theory Society, 2019, pp. 241–245. [CrossRef]
  30. Harremoës, P. Rate Distortion Theory for Descriptive Statistics. Entropy 25, 456. [CrossRef]
  31. Rényi, A. On an Extremal Property of the Poisson Process. Annals Inst. Stat. Math. 1964, 16, 129–133. [Google Scholar] [CrossRef]
  32. McFadden, J.A. The Entropy of a Point Process. J. Soc. Indst. Appl. Math. 1965, 13, 988–994. [Google Scholar] [CrossRef]
  33. Harremoës, P. Binomial and Poisson Distributions as Maximum Entropy Distributions. IEEE Trans. Inform. Theory 2001, 47, 2039–2041. [Google Scholar] [CrossRef]
  34. Harremoës, P.; Johnson, O.; Kontoyiannis, I. Thinning and Information Projection. 2008 IEEE International Symposium on Information Theory. IEEE, 2008, pp. 2644–2648.
  35. Harremoës, P.; Johnson, O.; Kontoyiannis, I. Thinning, Entropy and the Law of Thin Numbers. IEEE Trans. Inform Theory 2010, 56, 4228–4244. [Google Scholar] [CrossRef]
  36. Hillion, E.; Johnson, O. A proof oof the Shepp-Olkin entropy concavity conjecture. Bernoulli 2017, arXiv:1503.0157023, 3638–3649. [Google Scholar] [CrossRef]
  37. Dawid, A.P. Separoids: A mathematical framework for conditional independence and irrelevance. Ann. Math. Artif. Intell. 2001, 32, 335–372. [Google Scholar] [CrossRef]
  38. Harremoës, P. Entropy inequalities for Lattices. Entropy 2018, 20, 748. [Google Scholar] [CrossRef]
  39. Harremoës, P. Divergence and Sufficiency for Convex Optimization. Entropy 2017, arXiv:1701.0101019, 206. [Google Scholar] [CrossRef]
  40. Csiszár, I. I-Divergence Geometry of Probability Distributions and Minimization Problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
  41. Pfaffelhuber, E. Minimax Information Gain and Minimum Discrimination Principle. Topics in Information Theory; Csiszár, I.; Elias, P., Eds. János Bolyai Mathematical Society and North-Holland, 1977, Vol. 16, Colloquia Mathematica Societatis János Bolyai, pp. 493–519.
  42. Topsøe, F. Information Theoretical Optimization Techniques. Kybernetika 1979, 15, 8–27. [Google Scholar]
  43. Csiszár, I.; Tusnády, G. Information geometry and alternating minimization procedures. Statistics and Decisions 1984, (Supplementary Issue 1), 205.237. [Google Scholar]
  44. Li, J.Q. Estimation of Mixture Models. Ph.d. dissertation, Department of Statistics, Yale University,, 1999. [Google Scholar]
  45. Li, J.Q.; Barron, A.R. Mixture Density Estimation. Proceedings Conference on Neural Information Processing Systems: Natural and Synthetic;, 1999.
  46. Harremoës, P. Bounds on tail probabilities for negative binomial distributions. Kybernetika 2016, 52, 943–966. [Google Scholar] [CrossRef]
  47. Harremoës, P.; Tusnády, G. Information Divergence is more χ2-distributed than the χ2-statistic. 2012 IEEE International Symposium on Information Theory; IEEE,, 2012; pp. 538–543. [CrossRef]
  48. Györfi, L.; Harremoës, P.; Tusnády, T. Some Refinements of Large Deviation Tail Probabilities. arXiv:1205.1005.
  49. Kass, R.E.; Wasserman, L.A. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association 1996, 91, 1343–1370. [Google Scholar] [CrossRef]
  50. Grünwald, P. The Minimum Description Length principle; MIT Press, 2007. [Google Scholar]
  51. Harremoës, P. Entropy on Spin Factors. Information Geometry and Its Applications; Ay, N.; Gibilisco, P.; Matúš, F., Eds. Springer, 2018, Vol. 252, Springer Proceedings in Mathematics & Statistics, pp. 247–278, [arXiv:1707.03222]. arXiv:1707.03222].
  52. Harremoës, P.; Matúš, F. Bounds on the Information Divergence for Hypergeometric Distributions. Kybernetika 2020, 56, 1111–1132. [Google Scholar] [CrossRef]
  53. Harremoës, P.; Johnson, O.; Kontoyiannis, I. Thinning and Information Projections. arXiv:1601.04255.
  54. Harremoës, P. Convergence to the Poisson Distribution in Information Divergence. Preprint 2, Mathematical department, University of Copenhagen, 2003.
  55. Harremoës, P.; Ruzankin, P. Rate of Convergence to Poisson Law in Terms of Information Divergence. IEEE Trans. Inform Theory 2004, 50, 2145–2149. [Google Scholar] [CrossRef]
  56. Kontoyiannis, I.; Harremoës, P.; Johnson, O. Entropy and the Law of Small Numbers. IEEE Trans. Inform. Theory 2005, 51, 466–472. [Google Scholar] [CrossRef]
  57. Harremoës, P. Lower Bounds for Divergence in the Central Limit Theorem. In General Theory of Information Transfer and Combinatorics; Springer Berlin Heidelberg: Berlin, Heidelberg, 2006; pp. 578–594. [Google Scholar] [CrossRef]
  58. Harremoës, P. Lower bound on rate of convergence in information theoretic Central Limit Theorem. Book of Abstracts for the Seventh International Symposium on Orthogonal Polynomials, Special functions and Applications;, 2003; pp. 53–54.
  59. Harremoës, P. Lower Bounds on Divergence in Central Limit Theorem. Electronic Notes in Discrete Mathematics 2005, 21, 309–313. [Google Scholar] [CrossRef]
  60. Harremoës, P. Maximum Entropy on Compact groups. Entropy 2009, 11, 222–237. [Google Scholar] [CrossRef]
Figure 1. When coherent light is sent through first a single slit in screen S1 and then through the two slits b and c on screen S2, then an interference pattern emerges at the photographic film F.
Figure 1. When coherent light is sent through first a single slit in screen S1 and then through the two slits b and c on screen S2, then an interference pattern emerges at the photographic film F.
Preprints 122115 g001
Figure 2. QQ-plot of a standard Gaussian distribution against the distribution of G 20 X , where X has a binomial distribution with n = 20 and p = 1 / 2 .
Figure 2. QQ-plot of a standard Gaussian distribution against the distribution of G 20 X , where X has a binomial distribution with n = 20 and p = 1 / 2 .
Preprints 122115 g002
Figure 3. QQ-plot of a standard Gaussian distribution against the distribution of G N X , where X has a Poisson distributed with mean λ = 10 and N = X + Y where Y is also Poisson distributed with mean 10 and Y is independent of X.
Figure 3. QQ-plot of a standard Gaussian distribution against the distribution of G N X , where X has a Poisson distributed with mean λ = 10 and N = X + Y where Y is also Poisson distributed with mean 10 and Y is independent of X.
Preprints 122115 g003
Figure 4. QQ-plot of the χ 2 -distribution with d f = 1 against the distribution of the G 2 -statistic for testing p = 1 / 2 in b i n 20 , p .
Figure 4. QQ-plot of the χ 2 -distribution with d f = 1 against the distribution of the G 2 -statistic for testing p = 1 / 2 in b i n 20 , p .
Preprints 122115 g004
Figure 5. QQ-plot of the χ 2 -distribution with d f = 1 against the distribution of the G 2 -statistic for testing λ = μ based on λ ^ + μ ^ = 20 . We see that there is a tiny but systematic deviation from the straight line in that the values of G 2 are a little larger than predicted by the χ 2 -distribution. A continuity correction as described in [48] may handle this minor issue.
Figure 5. QQ-plot of the χ 2 -distribution with d f = 1 against the distribution of the G 2 -statistic for testing λ = μ based on λ ^ + μ ^ = 20 . We see that there is a tiny but systematic deviation from the straight line in that the values of G 2 are a little larger than predicted by the χ 2 -distribution. A continuity correction as described in [48] may handle this minor issue.
Preprints 122115 g005
Figure 6. Support of the joint measure.
Figure 6. Support of the joint measure.
Preprints 122115 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated