稀疏的专家是可推广的域学习者

论文标题

稀疏的专家是可推广的域学习者

Sparse Mixture-of-Experts are Domain Generalizable Learners

论文作者

Li, Bo, Shen, Yifei, Yang, Jingkang, Wang, Yezhen, Ren, Jiawei, Che, Tong, Zhang, Jun, Liu, Ziwei

论文摘要

人类的视觉感知可以轻松地推广到分布的视觉数据，这远远超出了现代机器学习模型的能力。域的概括（DG）旨在缩小此差距，现有的DG方法主要集中在损失函数设计上。在本文中，我们建议探索正交方向，即骨干建筑的设计。这是由一个经验发现的激励，即基于经验风险最小化（ERM）训练的基于变压器的模型（ERM）优于CNN的基于CNN的模型，该模型在多个DG数据集中采用了最先进的ART（SOTA）DG算法。我们开发了一个正式的框架，以通过研究其体系结构与数据集中的相关性的一致性来表征网络的稳健性。该分析指导我们提出一种基于视觉变压器的新型DG模型，即可推广的Experts（GMOE）。在域床上进行的广泛实验表明，经过ERM训练的GMOE胜过SOTA DG基准的良好边缘。此外，GMOE是现有DG方法的互补性，使用DG算法培训时，其性能会大大提高。

Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题