论文标题
卷积神经网络中的学习进步
Learning strides in convolutional neural networks
论文作者
论文摘要
卷积神经网络通常包含几个下采样操作员,例如卷积或汇总层,可逐渐减少中间表示的分辨率。这提供了一些转移性,同时降低了整个体系结构的计算复杂性。这种层的关键超参数是它们的大步:下采样的整数因子。由于步伐没有可区分的,因此找到最佳配置需要交叉验证或离散优化(例如,体系结构搜索),随着搜索空间的成倍增长,随着降采样层的数量,它们会迅速变得越来越高。因此,通过梯度下降探索此搜索空间将允许以较低的计算成本找到更好的配置。这项工作引入了Diffstride,这是第一个具有可学习步伐的下采样层。我们的图层学习了傅立叶域中的裁剪面膜的大小,该范围以可不同的方式有效地进行了调整大小。关于音频和图像分类的实验显示了解决方案的一般性和有效性:我们使用DiffStride作为标准下采样层的倒入替换,并胜过它们。特别是,我们表明,即使训练从不良的随机步幅配置开始,也可以在CIFAR10,CIFAR100和Imagenet上将层引入RESNET-18架构可以保持一致的高性能。此外,作为可学习的变量的制定步幅使我们能够引入一个正规化术语,该术语控制体系结构的计算复杂性。我们展示了这种正则化如何允许在Imagenet上进行效率的准确性。
Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.