论文标题
通过动态阈值学习来增强二进制神经网络
Boosting Binary Neural Networks via Dynamic Thresholds Learning
论文作者
论文摘要
开发轻巧的深卷积神经网络(DCNN)和视觉变压器(VIT)已成为视力研究的重点之一,因为低计算成本对于在边缘设备上部署视觉模型至关重要。最近,研究人员通过将重量和激活的全精度神经网络的激活来探索了高度计算的有效二进制神经网络(BNN)。但是,二进制过程导致BNN及其完整精确版本之间存在巨大的精度差距。主要原因之一是,具有预定义或学习的静态阈值的符号函数限制了二进制结构的表示能力,因为单阈值二进制无法利用激活分布。为了克服此问题,我们将通道信息的统计数据介绍给具有符号功能的明确阈值学习,以根据输入分布生成各种阈值。我们的动态障碍是一种简单的方法,可以减少信息丢失并提高BNN的代表性能力,可以灵活地应用于DCNN和VIT(即Dybcnn和DybinaryCCT),以实现有希望的性能提高。如我们广泛的实验所示。对于DCNN,基于两个骨架(MobilenetV1和Resnet18)的dybcnn在Imagenet数据集上实现71.2%和67.4%的Top1-准确量,超过了大底线,以较大的差距(即分别为1.8%和1.5%)。对于VIT,DybinaryCCT在完全二进制的VIT中呈现了卷积嵌入层的优越性,并且在Imagenet数据集中获得了56.1%的优势,该数据集比基线高9%。
Developing lightweight Deep Convolutional Neural Networks (DCNNs) and Vision Transformers (ViTs) has become one of the focuses in vision research since the low computational cost is essential for deploying vision models on edge devices. Recently, researchers have explored highly computational efficient Binary Neural Networks (BNNs) by binarizing weights and activations of Full-precision Neural Networks. However, the binarization process leads to an enormous accuracy gap between BNN and its full-precision version. One of the primary reasons is that the Sign function with predefined or learned static thresholds limits the representation capacity of binarized architectures since single-threshold binarization fails to utilize activation distributions. To overcome this issue, we introduce the statistics of channel information into explicit thresholds learning for the Sign Function dubbed DySign to generate various thresholds based on input distribution. Our DySign is a straightforward method to reduce information loss and boost the representative capacity of BNNs, which can be flexibly applied to both DCNNs and ViTs (i.e., DyBCNN and DyBinaryCCT) to achieve promising performance improvement. As shown in our extensive experiments. For DCNNs, DyBCNNs based on two backbones (MobileNetV1 and ResNet18) achieve 71.2% and 67.4% top1-accuracy on ImageNet dataset, outperforming baselines by a large margin (i.e., 1.8% and 1.5% respectively). For ViTs, DyBinaryCCT presents the superiority of the convolutional embedding layer in fully binarized ViTs and achieves 56.1% on the ImageNet dataset, which is nearly 9% higher than the baseline.