论文标题
PLM的困惑对于评估文本质量是不可靠的
Perplexity from PLM Is Unreliable for Evaluating Text Quality
论文作者
论文摘要
最近,数量的作品利用困惑〜(PPL)评估生成的文本的质量。他们认为,如果PPL的值较小,则要评估的文本的质量(即流利)更好。但是,我们发现PPL裁判是不合格的,由于以下原因,它无法公平地评估生成的文本:(i)短文本的PPL大于长期文本,这违背了常识,(ii)重复的文本跨度可能会损害PPL的性能,并且(iii)标点符号可能会影响PPL的表现。实验表明,PPL对于评估给定文本的质量是不可靠的。最后,我们讨论使用语言模型评估文本质量的关键问题。
Recently, amounts of works utilize perplexity~(PPL) to evaluate the quality of the generated text. They suppose that if the value of PPL is smaller, the quality(i.e. fluency) of the text to be evaluated is better. However, we find that the PPL referee is unqualified and it cannot evaluate the generated text fairly for the following reasons: (i) The PPL of short text is larger than long text, which goes against common sense, (ii) The repeated text span could damage the performance of PPL, and (iii) The punctuation marks could affect the performance of PPL heavily. Experiments show that the PPL is unreliable for evaluating the quality of given text. Last, we discuss the key problems with evaluating text quality using language models.