当CNN遇到随机RNN时：迈向RGB-D对象和场景识别的多层次分析

论文标题

当CNN遇到随机RNN时：迈向RGB-D对象和场景识别的多层次分析

When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition

论文作者

Caglayan, Ali, Imamoglu, Nevrez, Can, Ahmet Burak, Nakamura, Ryosuke

论文摘要

识别对象和场景是图像理解中的两个具有挑战性但必不可少的任务。特别是，使用RGB-D传感器来处理这些任务已成为更好的视觉理解的重要重点领域。同时，深层神经网络，特别是卷积神经网络（CNN），已广泛地扩展，并通过用有效的深度特征替换手工制作的功能来应用于许多视觉任务。但是，这是一个开放的问题，如何有效地利用多层CNN模型中的深层特征。在本文中，我们提出了一个新颖的两阶段框架，该框架从多模式RGB-D图像中提取歧视性特征表示，以实现对象和场景识别任务。在第一阶段，已验证的CNN模型已被用作多个级别提取视觉特征的骨干。第二阶段将这些特征绘制为高级表示，具有有效的递归神经网络（RNN）的完全随机结构。为了应对CNN激活的高维度，通过扩展RNN中的随机性概念提出了随机加权方案。多模式融合已经通过软投票方法通过基于RGB和深度流的个人识别信心（即SVM分数）来计算权重来进行。这会在最终的RGB-D分类性能中产生一致的类标签估计。广泛的实验验证了RNN阶段中完全随机的结构可成功地编码CNN激活，以成功地辨别固体特征。关于对象和场景识别任务中最新的方法，对流行的华盛顿RGB-D对象和Sun RGB-D场景数据集的比较实验结果表明，与对象和场景识别任务中的最先进方法相比，所提出的方法具有优越或出色的性能。代码可在https://github.com/acaglayan/cnn_randrnn上找到。

Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. To cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach achieves superior or on-par performance compared to state-of-the-art methods both in object and scene recognition tasks. Code is available at https://github.com/acaglayan/CNN_randRNN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题