
PUBLICATIONS

ACCUEIL
 PUBLICATIONS


ANNEE : 2019Topological Information Data AnalysisAUTEURS : Baudot P, Goaillard JM, Bennequin D, Tapia M. REVUE : Entropy This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the kmultivariate mutualinformation ( Ik ) inspired by the topological formulation of Information introduced in a serie of studies. In particular, we show that the vanishing of all Ik for 2≤k≤n of n random variables is equivalent to their statistical independence. Pursuing the work of Hu Kuo Ting and Te Sun Han, we show that information functions provide coordinates for binary variables, and that they are analytically independent from the probability simplex for any set of finite variables. The maximal positive Ik identifies the variables that covary the most in the population, whereas the minimal negative Ik identifies synergistic clusters and the variables that differentiate–segregate the most in the population. Finite data size effects and estimation biases severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and the kdependences. We give an example of application of these methods to genetic expression and unsupervised celltype classification. The methods unravel biologically relevant subtypes, with a sample size of 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism. We propose that higherorder statistical interactions and nonidentically distributed variables are constitutive characteristics of biological systems that should be estimated in order to unravel their significant statistical structure and diversity. The topological information data analysis presented here allows for precisely estimating this higherorder structure characteristic of biological systems.

