关于密度估计与序列产生之间的差异

论文标题

关于密度估计与序列产生之间的差异

On the Discrepancy between Density Estimation and Sequence Generation

论文作者

Lee, Jason, Tran, Dustin, Firat, Orhan, Cho, Kyunghyun

论文摘要

在输入x：p（y | x）的情况下，许多序列到序列生成任务，包括机器翻译和文本到语音的任务都可以估算输出y的密度。鉴于这种解释，自然可以在测试集上使用条件对数可能性来评估序列到序列模型。但是，序列到序列生成（或结构化预测）的目标是找到最佳的输出y^给定输入X，并且每个任务都有其自己的下游度量R，通过与一组参考y*：r（y^，y* | x）进行比较来得分模型输出。尽管我们希望在密度估计中表现出色的模型在下游度量标准上也表现良好，但尚未对序列生成任务进行确切的相关性。在本文中，通过比较五个机器翻译任务上的几个密度估计器，我们发现基于对数可能性和BLEU的模型排名之间的相关性取决于要比较的模型系列的范围。首先，当我们考虑同一家族中的模型时（例如自回归模型或具有相同参数化的潜在变量模型）时，对数可能与BLEU高度相关。但是，我们观察到不同家庭模型的排名之间没有相关性：（1）在非自动回调潜在可变模型中，灵活的先验分布在密度估计下的柔性更高，但发电质量比简单的先验更差，并且（2）自动性模型可提供最佳的翻译性能，而具有标准化的固定流量为正常的固定流量跨越了所有数据，跨越了所有数据。因此，我们建议在需要快速生成速度时使用潜在变量非自动回旋模型的简单先验。

Many sequence-to-sequence generation tasks, including machine translation and text-to-speech, can be posed as estimating the density of the output y given the input x: p(y|x). Given this interpretation, it is natural to evaluate sequence-to-sequence models using conditional log-likelihood on a test set. However, the goal of sequence-to-sequence generation (or structured prediction) is to find the best output y^ given an input x, and each task has its own downstream metric R that scores a model output by comparing against a set of references y*: R(y^, y* | x). While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks. In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on log-likelihood and BLEU varies significantly depending on the range of the model families being compared. First, log-likelihood is highly correlated with BLEU when we consider models within the same family (e.g. autoregressive models, or latent variable models with the same parameterization of the prior). However, we observe no correlation between rankings of models across different families: (1) among non-autoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autoregressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest held-out log-likelihood across all datasets. Therefore, we recommend using a simple prior for the latent variable non-autoregressive model when fast generation speed is desired.

下载PDF全文

下载文献需遵守相关版权规定

论文标题