论文标题

Harim $^+$:评估总结质量以及幻觉风险

HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk

论文作者

Son, Seonil, Park, Junsoo, Hwang, Jeong-in, Lee, Junghwa, Noh, Hyungjong, Lee, Yeonsoo

论文摘要

开发摘要模型的挑战之一是由于难以衡量生成的文本的事实不一致而引起的。在这项研究中,我们重新解释了解码器在(Miao等,2021)中提出的过度固有性调查目标,作为一种幻觉风险测量,以更好地估计生成的摘要的质量。我们提出了一个无参考度量的Harim+,该指标仅需要一个现成的摘要模型来根据令牌可能性计算幻觉风险。部署它不需要对模型或临时模块进行额外的培训,这通常需要与人类判断保持一致。为了进行摘要质量估计,Harim+记录了与人类判断的最新相关性,该相关性与三个摘要质量注释集:Frank,Qags和Summeval。我们希望我们的工作值得使用摘要模型,从而有助于自动化评估和摘要的产生。

One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源