论文标题
文本生成无参考评估的虚假相关性
Spurious Correlations in Reference-Free Evaluation of Text Generation
论文作者
论文摘要
已经提出了基于模型的,无参考的评估指标作为评估自然语言生成(NLG)系统的快速且具有成本效益的方法。尽管有希望的最近结果,但我们发现证据表明,摘要和对话生成的无参考评估指标可能依赖于与单词重叠,困惑和长度等措施的虚假相关性。我们进一步观察到,对于文本摘要,在对当前最新抽象性摘要系统进行排名时,这些指标具有很高的错误率。我们证明,可以通过明确设计评估指标来减轻这些错误,以避免无参考评估中的虚假特征。
Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further observe that for text summarization, these metrics have high error rates when ranking current state-of-the-art abstractive summarization systems. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.