论文标题
X-Learner:学习通用视觉表示的学习交叉来源和任务
X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation
论文作者
论文摘要
在计算机愿景中,在过去几年中,基于大刻度监督学习的预训练模型已被证明有效。但是,现有的工作主要集中在通过单个数据源从各个任务中学习(例如,用于分类的成像网或可可进行检测)。由于缺乏来自各种任务和数据源的大量语义信息,因此这种限制的形式限制了它们的普遍性和可用性。在这里,我们证明,从异质任务和多个数据源中共同学习有助于通用视觉表示,从而更好地转移了各种下游任务的结果。因此,学习如何在不同的任务和数据源之间弥合差距是关键,但这仍然是一个悬而未决的问题。在这项工作中,我们提出了一个名为X-Learner的表示形式学习框架,该框架学习了由各种来源监督的多个视觉任务的通用特征,并具有扩展和挤压阶段:1)扩展阶段:X-Learner了解特定于任务的功能,以减轻任务干扰并通过和解层来丰富表示的表示。 2)挤压阶段:X-Learner将模型凝结成合理的尺寸,并了解各种任务转移的通用和可推广的表示。广泛的实验表明,与现有的表示学习方法相比,X-Learner在没有额外的注释,方式和计算成本的情况下实现了不同的任务绩效。值得注意的是,单个X-Learner模型显示出在12个下游数据集上的当前验证模型的显着增长率为3.0%,3.3%和1.8%,用于分类,对象检测和语义分割。
In computer vision, pre-training models based on largescale supervised learning have been proven effective over the past few years. However, existing works mostly focus on learning from individual task with single data source (e.g., ImageNet for classification or COCO for detection). This restricted form limits their generalizability and usability due to the lack of vast semantic information from various tasks and data sources. Here, we demonstrate that jointly learning from heterogeneous tasks and multiple data sources contributes to universal visual representation, leading to better transferring results of various downstream tasks. Thus, learning how to bridge the gaps among different tasks and data sources is the key, but it still remains an open question. In this work, we propose a representation learning framework called X-Learner, which learns the universal feature of multiple vision tasks supervised by various sources, with expansion and squeeze stage: 1) Expansion Stage: X-Learner learns the task-specific feature to alleviate task interference and enrich the representation by reconciliation layer. 2) Squeeze Stage: X-Learner condenses the model to a reasonable size and learns the universal and generalizable representation for various tasks transferring. Extensive experiments demonstrate that X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs compared to existing representation learning methods. Notably, a single X-Learner model shows remarkable gains of 3.0%, 3.3% and 1.8% over current pretrained models on 12 downstream datasets for classification, object detection and semantic segmentation.