主观实验中主题行为的简单模型

论文标题

主观实验中主题行为的简单模型

A Simple Model for Subject Behavior in Subjective Experiments

论文作者

Li, Zhi, Bampis, Christos G., Krasula, Lukáš, Janowski, Lucjan, Katsavounidis, Ioannis

论文摘要

在评估多媒体和电视服务的感知视听质量的主观实验中，从测试主题中收集的原始意见分数通常是嘈杂的和不可靠的。为了产生最终的平均意见分数（MOS），诸如ITU-R BT.500，ITU-T P.910和ITU-T P.913等建议进行了标准化测试后筛选程序，以使用主题异常排斥和偏见的方法清理原始意见分数。在本文中，我们分析了先前的标准化技术以证明其弱点。作为替代方案，我们提出了一个简单的模型来说明主题不准确的两种主要行为：偏见和不一致。我们进一步表明，该模型还可以有效地处理给出随机得分的不专心主题。我们建议使用最大似然估计来共同求解模型参数，并呈现两个数字求解器：基于Newton-Raphson方法的第一个，第二个基于交替投影（AP）。我们表明，AP求解器通过权衡受试者对真实质量评分的贡献来概括ITU-T P.913测试后筛选程序（因此，估计的质量得分可以解释为偏差减少的一致性加权MOS）。 We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test主题。这项工作的代码在https://github.com/netflix/sureal上进行开源。

In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniques such as subject outlier rejection and bias removal. In this paper, we analyze the prior standardized techniques to demonstrate their weaknesses. As an alternative, we propose a simple model to account for two of the most dominant behaviors of subject inaccuracy: bias and inconsistency. We further show that this model can also effectively deal with inattentive subjects that give random scores. We propose to use maximum likelihood estimation to jointly solve the model parameters, and present two numeric solvers: the first based on the Newton-Raphson method, and the second based on an alternating projection (AP). We show that the AP solver generalizes the ITU-T P.913 post-test screening procedure by weighing a subject's contribution to the true quality score by her consistency (thus, the quality scores estimated can be interpreted as bias-subtracted consistency-weighted MOS). We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test subjects. The code for this work is open-sourced at https://github.com/Netflix/sureal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题