SummScore：基于跨编码器的摘要质量的全面评估指标

论文标题

SummScore：基于跨编码器的摘要质量的全面评估指标

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

论文作者

Lin, Wuhang, Li, Shasha, Zhang, Chen, Ji, Bin, Yu, Jie, Ma, Jun, Yi, Zibo

论文摘要

文本摘要模型通常经过培训，以产生满足人类质量要求的摘要。但是，现有的摘要文本评估指标只是摘要质量的粗略代理，与人类评分和抑制摘要多样性的相关性低。为了解决这些问题，我们提出了SummScore，这是基于CrossCoder的摘要质量评估的全面指标。首先，通过采用原始的苏格里测量模式并比较原始文本的语义，SummScore摆脱了抑制摘要多样性的抑制作用。借助文本匹配的预训练交叉编码器，SummScore可以有效地捕获摘要语义之间的细微差异。其次，为了提高全面性和解释性，SumpScore由四个细粒的子模型组成，它们分别测量连贯性，一致性，流利性和相关性。我们使用半监督的多轮训练来提高模型在极有限的注释数据上的性能。广泛的实验表明，与人类评分相关的上述四个维度中，SumpScore在上述四个维度中的现有评估指标显着优于现有的评估指标。我们还为16个主流摘要模型提供了SummScore的质量评估结果，以供以后的研究。

Text summarization models are often trained to produce summaries that meet human quality requirements. However, the existing evaluation metrics for summary text are only rough proxies for summary quality, suffering from low correlation with human scoring and inhibition of summary diversity. To solve these problems, we propose SummScore, a comprehensive metric for summary quality evaluation based on CrossEncoder. Firstly, by adopting the original-summary measurement mode and comparing the semantics of the original text, SummScore gets rid of the inhibition of summary diversity. With the help of the text-matching pre-training Cross-Encoder, SummScore can effectively capture the subtle differences between the semantics of summaries. Secondly, to improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels, which measure Coherence, Consistency, Fluency, and Relevance separately. We use semi-supervised multi-rounds of training to improve the performance of our model on extremely limited annotated data. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring. We also provide the quality evaluation results of SummScore on 16 mainstream summarization models for later research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题