使用压缩复杂度度量的因果发现

论文标题

使用压缩复杂度度量的因果发现

Causal Discovery using Compression-Complexity Measures

论文作者

SY, Pranay, Nagaraj, Nithin

论文摘要

因果推论是所有科学领域中最根本的问题之一。我们解决了从两个观察到的离散符号序列$ x $和$ y $推断因果方向的问题。我们提出了一个框架，该框架依赖于从序列对推断无上下文语法（CFG）的无损压缩机，并量化了从一个序列推断的语法压缩另一个序列的语法程度。如果语法从$ x $推断出的$ y $比另一个方向上，我们会推断出$ y $ y $。为了将这个概念练习，我们提出了三个模型，这些模型使用压缩复杂度度量（CCMS）-Lempel-Ziv（LZ）复杂性和努力压缩（ETC）来推断CFGS并发现不需要时间结构的因果方向。我们将这些模型评估在合成和现实世界的基准上，并以当前最新方法竞争性能竞争性能。最后，我们向属于SARS-COV-2病毒的基因组序列成对的因果推断提出了两种独特的应用。使用大量序列，我们表明我们的模型在序列对之间捕获了定向的因果信息交换，为解决关键问题提供了新的机会，例如接触追踪，基序发现，毒力的演变和未来应用中的致病性。

Causal inference is one of the most fundamental problems across all domains of science. We address the problem of inferring a causal direction from two observed discrete symbolic sequences $X$ and $Y$. We present a framework which relies on lossless compressors for inferring context-free grammars (CFGs) from sequence pairs and quantifies the extent to which the grammar inferred from one sequence compresses the other sequence. We infer $X$ causes $Y$ if the grammar inferred from $X$ better compresses $Y$ than in the other direction. To put this notion to practice, we propose three models that use the Compression-Complexity Measures (CCMs) - Lempel-Ziv (LZ) complexity and Effort-To-Compress (ETC) to infer CFGs and discover causal directions without demanding temporal structures. We evaluate these models on synthetic and real-world benchmarks and empirically observe performances competitive with current state-of-the-art methods. Lastly, we present two unique applications of the proposed models for causal inference directly from pairs of genome sequences belonging to the SARS-CoV-2 virus. Using a large number of sequences, we show that our models capture directed causal information exchange between sequence pairs, presenting novel opportunities for addressing key issues such as contact-tracing, motif discovery, evolution of virulence and pathogenicity in future applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题