论文标题
Halsie:通过同时利用图像和事件方式来学习细分的混合方法
HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously Exploiting Image and Event Modalities
论文作者
论文摘要
事件摄像机检测每个像素强度的变化,以生成异步“事件流”。与常规摄像机相比,由于其更高的时间分辨率和高动态范围(HDR),它们为实时自主系统中的准确语义图检索提供了巨大的潜力。但是,基于事件的分割的现有实现遭受了次优性能的影响,因为这些时间密集的事件仅测量视觉信号的不同组成部分,从而限制了它们与帧相比编码密集的空间上下文的能力。为了解决这个问题,我们提出了一个混合端到端的学习框架Halsie,利用三个关键概念将推理成本降低到$ 20 \ times $ $ v vess $ v vess v vess f ves -res Art,同时保持相似的性能:首先,一种简单有效的跨域学习方案来提取互补的时空 - 周期性固定的嵌入框架和事件。其次,具有尖峰神经网络(SNN)和人工神经网络(ANN)分支的特殊设计的双编码方案,以最大程度地减少延迟,同时保留跨域特征聚集。第三,一种多尺度提示混合器,用于模拟融合嵌入的丰富表示形式。 Halsie的这些素质允许在DDD-17,MVSEC和DSEC语义数据集上实现最轻巧的架构,最高33倍\ tims $ agiale $ 33 \ tims $较高的参数效率和有利的推理成本(每周周期17.9mj)。我们的消融研究还为有效的设计选择带来了新的见解,这些选择可以证明对其他视觉任务的研究有益。
Event cameras detect changes in per-pixel intensity to generate asynchronous `event streams'. They offer great potential for accurate semantic map retrieval in real-time autonomous systems owing to their much higher temporal resolution and high dynamic range (HDR) compared to conventional cameras. However, existing implementations for event-based segmentation suffer from sub-optimal performance since these temporally dense events only measure the varying component of a visual signal, limiting their ability to encode dense spatial context compared to frames. To address this issue, we propose a hybrid end-to-end learning framework HALSIE, utilizing three key concepts to reduce inference cost by up to $20\times$ versus prior art while retaining similar performance: First, a simple and efficient cross-domain learning scheme to extract complementary spatio-temporal embeddings from both frames and events. Second, a specially designed dual-encoder scheme with Spiking Neural Network (SNN) and Artificial Neural Network (ANN) branches to minimize latency while retaining cross-domain feature aggregation. Third, a multi-scale cue mixer to model rich representations of the fused embeddings. These qualities of HALSIE allow for a very lightweight architecture achieving state-of-the-art segmentation performance on DDD-17, MVSEC, and DSEC-Semantic datasets with up to $33\times$ higher parameter efficiency and favorable inference cost (17.9mJ per cycle). Our ablation study also brings new insights into effective design choices that can prove beneficial for research across other vision tasks.