查找具有语法诱导的数据集快捷方式

论文标题

查找具有语法诱导的数据集快捷方式

Finding Dataset Shortcuts with Grammar Induction

论文作者

Friedman, Dan, Wettig, Alexander, Chen, Danqi

论文摘要

已经发现许多NLP数据集都包含快捷方式：简单的决策规则，这些规则具有令人惊讶的高精度。但是，很难自动发现快捷方式。自动快捷方式检测的先前工作重点是枚举诸如Unigram或Bigrams之类的功能，这些功能仅能找到低级快捷键，或依赖于事后模型的可解释性方法（如显着图），这些方法揭示了定性模式，而没有明确的统计解释。在这项工作中，我们建议使用概率语法来表征和发现NLP数据集中的快捷方式。具体来说，我们使用无上下文的语法来建模句子分类数据集中的模式，并使用无上下文的语法来模拟涉及句子对的数据集。由此产生的语法在许多数据集中揭示了有趣的快捷方式功能，包括简单和高级功能，并自动识别传统分类器失败的测试示例组。最后，我们表明我们发现的功能可用于生成诊断对比示例，并将其纳入标准的强大优化方法中，以提高最差的组精度。

Many NLP datasets have been found to contain shortcuts: simple decision rules that achieve surprisingly high accuracy. However, it is difficult to discover shortcuts automatically. Prior work on automatic shortcut detection has focused on enumerating features like unigrams or bigrams, which can find only low-level shortcuts, or relied on post-hoc model interpretability methods like saliency maps, which reveal qualitative patterns without a clear statistical interpretation. In this work, we propose to use probabilistic grammars to characterize and discover shortcuts in NLP datasets. Specifically, we use a context-free grammar to model patterns in sentence classification datasets and use a synchronous context-free grammar to model datasets involving sentence pairs. The resulting grammars reveal interesting shortcut features in a number of datasets, including both simple and high-level features, and automatically identify groups of test examples on which conventional classifiers fail. Finally, we show that the features we discover can be used to generate diagnostic contrast examples and incorporated into standard robust optimization methods to improve worst-group accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题