概念化基于文本的因果推理中的治疗泄漏

论文标题

概念化基于文本的因果推理中的治疗泄漏

Conceptualizing Treatment Leakage in Text-based Causal Inference

论文作者

Daoud, Adel, Jerzak, Connor T., Johansson, Richard

论文摘要

在社会科学和其他容易获得文本的学科中，控制基于文本的混杂因素的因果推理方法变得越来越重要。但是，这些方法取决于没有治疗泄漏的关键假设：也就是说，文本仅包含有关混杂因素的信息，并且没有有关治疗分配的信息。当此假设不成立时，控制文本以调整混杂因素的方法面临后处理问题（对撞机）偏差的问题。但是，在涉及文本的现实情况下，由于人类语言丰富而灵活，因此在涉及文本的现实情况下没有治疗泄漏的假设可能是不现实的。出现在公共政策文件或健康记录中的语言可以同时指未来和过去，从而揭示有关治疗作业的信息。在本文中，我们定义了治疗裸露的问题，并讨论了识别以及提出的估计挑战。其次，我们通过在预处理步骤中删除文本中的治疗相关信号来描述可以解决泄漏的条件，我们将其定义为文本蒸馏。最后，使用模拟，我们显示了处理泄漏如何引入平均治疗效果（ATE）的偏见以及文本蒸馏如何减轻这种偏见。

Causal inference methods that control for text-based confounders are becoming increasingly important in the social sciences and other disciplines where text is readily available. However, these methods rely on a critical assumption that there is no treatment leakage: that is, the text only contains information about the confounder and no information about treatment assignment. When this assumption does not hold, methods that control for text to adjust for confounders face the problem of post-treatment (collider) bias. However, the assumption that there is no treatment leakage may be unrealistic in real-world situations involving text, as human language is rich and flexible. Language appearing in a public policy document or health records may refer to the future and the past simultaneously, and thereby reveal information about the treatment assignment. In this article, we define the treatment-leakage problem, and discuss the identification as well as the estimation challenges it raises. Second, we delineate the conditions under which leakage can be addressed by removing the treatment-related signal from the text in a pre-processing step we define as text distillation. Lastly, using simulation, we show how treatment leakage introduces a bias in estimates of the average treatment effect (ATE) and how text distillation can mitigate this bias.

下载PDF全文

下载文献需遵守相关版权规定

论文标题