论文标题
关于对比的自我监督学习中转型的组成
On Compositions of Transformations in Contrastive Self-Supervised Learning
论文作者
论文摘要
在图像域中,可以通过通过噪声对比学习诱导不变性转换来学习出色的表示。在本文中,我们将对比度学习概括为更广泛的转换及其组成,以寻求不变性或独特性。我们表明,如何将现有方法(例如SIMCLR)进行扩展并不明显。取而代之的是,我们介绍了许多对比配方必须满足所有对比的正式要求,并提出了满足这些要求的实用结构。为了最大程度地提高此分析的范围,我们将噪声对比配方的所有组成部分表示为数据的某些广义转换(GDT),包括数据采样。然后,我们将视频视为数据的一个示例,其中适用了多种转换的数据,并考虑了额外的模式 - 我们分析音频和文本 - 以及时间的尺寸。我们发现,对于某些转型而言不变,对他人的独特性对于学习有效的视频表示,从而通过大幅度提高了多个基准的最先进,甚至超过了被监督的预处理。
In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR can be extended to do so. Instead, we introduce a number of formal requirements that all contrastive formulations must satisfy, and propose a practical construction which satisfies these requirements. In order to maximise the reach of this analysis, we express all components of noise contrastive formulations as the choice of certain generalized transformations of the data (GDTs), including data sampling. We then consider videos as an example of data in which a large variety of transformations are applicable, accounting for the extra modalities -- for which we analyze audio and text -- and the dimension of time. We find that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art for multiple benchmarks by a large margin, and even surpassing supervised pretraining.