Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems

Pierre Baudot; Monica Tapia; Jean-Marc Goaillard

doi:10.20944/preprints201804.0157.v1

Submitted:

09 April 2018

Posted:

12 April 2018

You are already at the latest version

Abstract

This paper establishes methods that quantify the structure of statistical interactions within a given data set using the characterization of information theory in cohomology by finite methods, and provides their expression in terms of statistical physic and machine learning. Following [1–3], we show directly that k multivariate mutual-informations (I_k) are k-coboundaries. The k-cocycles are given by I_k = 0, which generalize statistical independence to arbitrary dimension k. The topological approach allows to investigate Shannon’s information in the multivariate case without the assumptions of independent identically distributed variables. We develop the computationally tractable subcase of simplicial information cohomology represented by entropy H_k and information I_k landscapes. The I₁ component defines a self-internal energy functional U_k, and (−1)^k I_k_,k≥2 components define the contribution to a free energy functional G_k of the k-body interactions. The set of information paths in simplicial structures is in bijection with the symmetric group and random processes, provides a topological expression of the 2nd law and points toward a discrete Noether theorem (1st law). The local minima of free-energy, related to conditional information negativity and the non-Shannonian cone of Yeung [4], characterize a minimum free energy complex. This complex formalizes the minimum free-energy principle in topology, provides a definition of a complex system, and characterizes a multiplicity of local minima that quantifies the diversity observed in biology. Finite data size effects and estimation bias severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and for the k-dependences following [5]. We give an example of application of these methods to genetic expression and cell-type classification. The maximal positive I_k identifies the variables that co-vary the most in the population, whereas the minimal negative I_k identifies clusters and the variables that differentiate-segregate the most. The methods unravel biologically relevant I₁₀ with a sample size of 41. It establishes generic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism.

Keywords:

information theory

;

cohomology

;

algebraic topology

;

topological data analysis

;

genetic expression

;

epigenetics

;

machine learning

;

statistical physic

;

multivariate mutual-information

;

complex systems

;

biodiversity

Subject:

Computer Science and Mathematics - Probability and Statistics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe