论文标题
多主质量质量检查是否处于可怕状态?测量和减少断开推理
Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning
论文作者
论文摘要
多跳问答中有真正的进步吗?模型通常利用数据集文物来产生正确的答案,而无需连接多个支持事实的信息。这限制了我们衡量真正进步的能力,并破坏了构建多跳QA数据集的目的。我们为解决这个问题做出了三项贡献。首先,我们将这种不受欢迎的行为形式化为跨支持事实的子集断开推理。这允许开发一个模型不合时宜的探测器,以测量任何模型可以通过断开推理作弊的数量。其次,使用\ emph {对比支持足够}的概念,我们引入了现有数据集的自动转换,以减少断开推理的量。第三,我们的实验表明,在阅读理解环境中,多跳质量质量检查的进展并不多。对于最近的大型模型(XLNET),我们表明,在HOTPOTQA上的答案F1得分中只有18个点是通过多触及推理获得的,与简单的RNN基线的得分大致相同。我们的转型大大减少了断开的推理(答案F1中的19分)。它与对抗方法互补,从而进一步减少结合。
Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multi-hop QA datasets. We make three contributions towards addressing this. First, we formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts. This allows developing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of \emph{contrastive support sufficiency}, we introduce an automatic transformation of existing datasets that reduces the amount of disconnected reasoning. Third, our experiments suggest that there hasn't been much progress in multi-hop QA in the reading comprehension setting. For a recent large-scale model (XLNet), we show that only 18 points out of its answer F1 score of 72 on HotpotQA are obtained through multifact reasoning, roughly the same as that of a simpler RNN baseline. Our transformation substantially reduces disconnected reasoning (19 points in answer F1). It is complementary to adversarial approaches, yielding further reductions in conjunction.