论文标题
有效适应多任务共同培训,用于统一自动驾驶
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving
论文作者
论文摘要
为了同时朝着对多个下游任务的整体理解,需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现,但它们对多任务学习方案的概括能力尚待探索。在本文中,我们在三个下游任务上广泛研究了各种类型的自我监督方法的传输性能,例如Moco和Simclr,包括语义分割,可驱动的区域细分和交通对象检测,在大规模驱动数据集BDD100K上。我们出人意料地发现,他们的表现是最佳的甚至落后于单任务基线的落后,这可能是由于训练目标和建筑设计的区别所致,这是在预处理范围内的。为了克服这一难题,并避免重新设计资源密集的预训练阶段,我们提出了一种简单而有效的预处理 - 适应性 - 限制范围,用于一般的多任务培训,可以有效地调整现成的预审预周态模型,而无需增加训练的训练。在自适应阶段,我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重,同时使经过预定的知识未经触及。此外,我们将视觉语言预训练模型剪辑视为对预处理 - 适应性 - 最终范式的有力补充,并提出了一个名为LV-Adapter的新型适配器,该适配器通过“任务特定的提示”和“视觉和文本特征”之间的特定提示和对齐。
Aiming towards a holistic understanding of multiple downstream tasks simultaneously, there is a need for extracting features with better transferability. Though many latest self-supervised pre-training methods have achieved impressive performance on various vision tasks under the prevailing pretrain-finetune paradigm, their generalization capacity to multi-task learning scenarios is yet to be explored. In this paper, we extensively investigate the transfer performance of various types of self-supervised methods, e.g., MoCo and SimCLR, on three downstream tasks, including semantic segmentation, drivable area segmentation, and traffic object detection, on the large-scale driving dataset BDD100K. We surprisingly find that their performances are sub-optimal or even lag far behind the single-task baseline, which may be due to the distinctions of training objectives and architectural design lied in the pretrain-finetune paradigm. To overcome this dilemma as well as avoid redesigning the resource-intensive pre-training stage, we propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training, where the off-the-shelf pretrained models can be effectively adapted without increasing the training overhead. During the adapt stage, we utilize learnable multi-scale adapters to dynamically adjust the pretrained model weights supervised by multi-task objectives while leaving the pretrained knowledge untouched. Furthermore, we regard the vision-language pre-training model CLIP as a strong complement to the pretrain-adapt-finetune paradigm and propose a novel adapter named LV-Adapter, which incorporates language priors in the multi-task model via task-specific prompting and alignment between visual and textual features.