主动评估：有效的NLG评估，几乎没有成对比较

论文标题

主动评估：有效的NLG评估，几乎没有成对比较

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

论文作者

Mohankumar, Akash Kumar, Khapra, Mitesh M.

论文摘要

最近的研究表明，使用成对比较而不是直接评估来评估NLG系统的优势。给定$ k $系统，一种用于识别顶级系统的天真方法是统一地从所有$ {k \ select 2} $对系统中获得成对比较。但是，这可能非常昂贵，因为所需的人类注释的数量将以$ k $进行四次增长。在这项工作中，我们引入了主动评估，这是一个框架，通过积极选择系统对使用决斗Bandit算法进行比较，以有效地识别排名最高的系统。我们在13个NLG评估数据集上使用13个Dueling Bandits算法进行了广泛的实验，这些数据集涉及5个任务，并证明可以减少80％的人类注释的数量。为了进一步减少人类注释的数量，我们提出了将自动评估指标与人类评估相结合的基于模型的决斗匪徒算法。具体而言，我们甚至在人类注释过程之前就消除了亚最佳系统，并仅在自动指标高度不确定的测试示例上进行人体评估。这将需要进一步的人类注释的数量减少89％。实际上，我们表明，确定排名最高的系统只需要几百个人类注释，这些注释与$ k $线性生长。最后，我们提供实用的建议和最佳实践，以有效地确定排名最高的系统。我们的代码已在https://github.com/akashkm99/duelnlg上公开提供。

Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given $k$ systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all ${k \choose 2}$ pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with $k$. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80%. To further reduce the number of human annotations, we propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89%. In effect, we show that identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with $k$. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently. Our code has been made publicly available at https://github.com/akashkm99/duelnlg

下载PDF全文

下载文献需遵守相关版权规定

论文标题