论文标题
统计物理和机器学习中的高阶相互作用:在平衡下对逆问题的模型无关的解决方案
Higher-order interactions in statistical physics and machine learning: A model-independent solution to the inverse problem at equilibrium
论文作者
论文摘要
从观察数据中推断出涉及大量相互作用变量的复杂系统中的配对和高阶相互作用的问题对于许多领域都是基础的。统计物理界称其为反问题,由于生成真实和模拟的“大”数据,近年来它已经可以访问。当前的反问题方法取决于参数假设,物理近似,例如平均场理论,忽略了可能导致偏见或不正确估计的高阶相互作用。我们使用跨学科方法绕过这些缺点,并证明这些假设和近似都不是必需的:我们通过非参数框架引入了通用,独立的和根本无偏见的全阶对称互动估计量,该估计值是目标学习的非参数框架,这是数学统计的子场。由于其普遍性,我们的定义很容易适用于与二元和分类变量平衡的任何系统,无论是磁性旋转,神经网络中的节点还是生物学中的蛋白质网络。我们的方法是针对性的,不需要拟合不必要的参数。取而代之的是,它花费了所有有关估计相互作用的数据,因此大大提高了准确性。我们在(i)二维ISING模型,(ii)具有4点相互作用的ISING样模型(III)限制性的Boltzmann机器和(IV)模拟的个体级人类DNA变体和代表性特征上的二维ISING模型,(ii)具有4点相互作用的ISING样模型(II),在分析和数值上证明了我们技术的通用性。后者证明了这种方法在人群生物医学中发现疾病的上皮相互作用的适用性。
The problem of inferring pair-wise and higher-order interactions in complex systems involving large numbers of interacting variables, from observational data, is fundamental to many fields. Known to the statistical physics community as the inverse problem, it has become accessible in recent years due to real and simulated 'big' data being generated. Current approaches to the inverse problem rely on parametric assumptions, physical approximations, e.g. mean-field theory, and ignoring higher-order interactions which may lead to biased or incorrect estimates. We bypass these shortcomings using a cross-disciplinary approach and demonstrate that none of these assumptions and approximations are necessary: We introduce a universal, model-independent, and fundamentally unbiased estimator of all-order symmetric interactions, via the non-parametric framework of Targeted Learning, a subfield of mathematical statistics. Due to its universality, our definition is readily applicable to any system at equilibrium with binary and categorical variables, be it magnetic spins, nodes in a neural network, or protein networks in biology. Our approach is targeted, not requiring fitting unnecessary parameters. Instead, it expends all data on estimating interactions, hence substantially increasing accuracy. We demonstrate the generality of our technique both analytically and numerically on (i) the 2-dimensional Ising model, (ii) an Ising-like model with 4-point interactions, (iii) the Restricted Boltzmann Machine, and (iv) simulated individual-level human DNA variants and representative traits. The latter demonstrates the applicability of this approach to discover epistatic interactions causal of disease in population biomedicine.