论文标题
评估多模式互动剂
Evaluating Multimodal Interactive Agents
论文作者
论文摘要
创建可以自然与人互动的代理是人工智能(AI)研究的共同目标。但是,评估这些相互作用是具有挑战性的:收集在线人类代理相互作用缓慢而昂贵,但是更快的代理指标通常与交互式评估息息相关。在本文中,我们评估了这些现有评估指标的优点,并提出了一种新颖的评估方法,称为标准化测试套件(STS)。 STS使用从实际人类交互数据中挖出的行为方案。代理商查看重播方案上下文,接收指令,然后将控制权以脱机方式完成交互。记录这些代理商的延续并将其发送给人类注释者以将其标记为成功或失败,并且根据其成功的延续比例对代理进行排名。最终的ST是自然主义相互作用的快速,控制,可解释的和代表的。总的来说,STS巩固了我们许多标准评估指标中所需的许多值,从而使我们能够加速研究的进步,以生产可以与人类自然互动的代理。可以在https://youtu.be/yr1tnggorgq上找到视频。
Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. A video may be found at https://youtu.be/YR1TngGORGQ.