探索和分析机器常识基准

论文标题

探索和分析机器常识基准

Exploring and Analyzing Machine Commonsense Benchmarks

论文作者

Santos, Henrique, Gordon, Minor, Liang, Zhicheng, Forbush, Gretchen, McGuinness, Deborah L.

论文摘要

以基准的形式进行常识的提问（QA）任务一直被引入，以挑战和比较常识性QA系统。这些基准提供了问题集，即系统开发人员可以在将实施方式提交官方排行榜之前用来训练和测试新模型。尽管创建了这些任务是为了评估确定的维度（例如主题，推理类型）的系统，但该元数据是有限的，并且很大程度上以非结构化格式或完全不存在。由于机器常识是一个快节奏的领域，因此对这些评估维度进行充分评估当前基准和系统的问题加剧了。我们认为，缺乏使这些方法的元数据限制了研究人员在理解系统缺陷和为将来的任务做出有效选择的努力方面缺乏共同的词汇。在本文中，我们首先就其元素及其元数据讨论了MCS生态系统。然后，我们通过最初专注于常识基准来介绍如何支持对方法的评估。我们描述了我们最初的MCS基准本体论，这是一种可扩展的共同词汇，它正式化了基准元数据，并展示了它如何支持基准工具的开发，该工具可以实现基准探索和分析。

Commonsense question-answering (QA) tasks, in the form of benchmarks, are constantly being introduced for challenging and comparing commonsense QA systems. The benchmarks provide question sets that systems' developers can use to train and test new models before submitting their implementations to official leaderboards. Although these tasks are created to evaluate systems in identified dimensions (e.g. topic, reasoning type), this metadata is limited and largely presented in an unstructured format or completely not present. Because machine common sense is a fast-paced field, the problem of fully assessing current benchmarks and systems with regards to these evaluation dimensions is aggravated. We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies and in making effective choices for future tasks. In this paper, we first discuss this MCS ecosystem in terms of its elements and their metadata. Then, we present how we are supporting the assessment of approaches by initially focusing on commonsense benchmarks. We describe our initial MCS Benchmark Ontology, an extensible common vocabulary that formalizes benchmark metadata, and showcase how it is supporting the development of a Benchmark tool that enables benchmark exploration and analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题