论文标题
MLM:用于多种语言和方式的多任务学习的基准数据集
MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
论文作者
论文摘要
在本文中,我们介绍了MLM(多种语言和模式)数据集 - 一种新资源,用于培训和评估多种方式和三种语言的样本上的多任务系统。语义数据的生成过程和包含提供了一种资源,该资源进一步测试了多任务系统学习实体之间关系的能力。该数据集专为在网络和数字档案中遇到的数据上执行多个任务的应用程序而设计的研究人员和开发人员。 MLM的第二版为欧盟国家提供了加权样品的地理代表性子集。我们证明了资源在数字人文科学中开发新颖应用程序的价值,并具有激励的用例,并指定了一组基准的任务,以检索模式并在数据集中找到实体。在MLM的完整和代表性版本上对基线多任务和单个任务系统的评估证明了对各种数据进行推广的挑战。除了数字人文科学外,我们还希望资源有助于多模式表示,位置估计和场景理解的研究。
In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.