论文标题
迈向IID表示及其在生物医学数据上的应用
Towards IID representation learning and its application on biomedical data
论文作者
论文摘要
由于现实世界数据的异质性,在最近的因果关系研究中批评了被广泛接受的独立和相同分布的假设(IID)。在本文中,我们认为,IID不是一个可疑的假设,而是一个基本的与任务相关的财产,需要学习。考虑$ k $独立的随机矢量$ \ MATHSF {x}^{i = 1,\ ldots,k} $,我们详细阐述了如何对各种不同的因果问题进行重新校正以学习与任务相关的函数$ ϕ $,从而在$ \\ Mathsf {z}^iid wer priencation $ = Mathsf { 学习。 为了获得概念证明,我们检查了IID表示对分布(OOD)概括任务的学习。具体来说,通过利用通过诱导IID的学识渊博的函数获得的表示,我们对两个生物医学数据集进行了分子特征(分子预测)的预测,该数据集具有由a)预分级变异和b)取样方案引入的现实世界分布偏移的。为了启用可重复性并与最新方法(SOTA)方法进行比较,这是通过遵循WildS建议的OOD基准指南来完成的。与Wild中支持的SOTA基线相比,结果证实了IID表示对OOD任务的出色表现。该代码可通过https://github.com/ctplab/iid_representation_learning公开访问。
Due to the heterogeneity of real-world data, the widely accepted independent and identically distributed (IID) assumption has been criticized in recent studies on causality. In this paper, we argue that instead of being a questionable assumption, IID is a fundamental task-relevant property that needs to be learned. Consider $k$ independent random vectors $\mathsf{X}^{i = 1, \ldots, k}$, we elaborate on how a variety of different causal questions can be reformulated to learning a task-relevant function $ϕ$ that induces IID among $\mathsf{Z}^i := ϕ\circ \mathsf{X}^i$, which we term IID representation learning. For proof of concept, we examine the IID representation learning on Out-of-Distribution (OOD) generalization tasks. Concretely, by utilizing the representation obtained via the learned function that induces IID, we conduct prediction of molecular characteristics (molecular prediction) on two biomedical datasets with real-world distribution shifts introduced by a) preanalytical variation and b) sampling protocol. To enable reproducibility and for comparison to the state-of-the-art (SOTA) methods, this is done by following the OOD benchmarking guidelines recommended from WILDS. Compared to the SOTA baselines supported in WILDS, the results confirm the superior performance of IID representation learning on OOD tasks. The code is publicly accessible via https://github.com/CTPLab/IID_representation_learning.