论文标题
在对话中相关性:少吗?现有指标的经验比较和一种新颖的简单指标
Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric
论文作者
论文摘要
在这项工作中,我们评估了各种现有的对话相关指标,发现对数据集的强烈依赖性,通常与人类相关性分数相关性不佳,并提出修改以降低数据要求和域灵敏度,同时改善相关性。我们提出的指标在Humod数据集上实现了最先进的性能,同时将对数据集的敏感性降低了37%-66%。我们无需微调审计的语言模型就能实现这一目标,只使用3,750个未经注释的人对话和一个负面的示例。尽管有这些限制,我们仍在来自不同域的四个数据集上展示了竞争性能。我们的代码,包括我们的指标和实验,是开源的。
In this work, we evaluate various existing dialogue relevance metrics, find strong dependency on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. Our proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66%. We achieve this without fine-tuning a pretrained language model, and using only 3,750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code, including our metric and experiments, is open sourced.