论文标题
重新访问用于语义细分的多尺度功能融合
Revisiting Multi-Scale Feature Fusion for Semantic Segmentation
论文作者
论文摘要
人们通常认为,对于准确的语义细分,必须使用昂贵的操作(例如非常卷积)结合使用昂贵的操作(例如非常卷积),从而导致缓慢的速度和大量的内存使用。在本文中,我们质疑这种信念,并证明既不需要高度的内部解决方案也不是必需的卷积。我们的直觉是,尽管分割是一个每像素的密集预测任务,但每个像素的语义通常都取决于附近的邻居和遥远的环境。因此,更强大的多尺度功能融合网络起着至关重要的作用。在此直觉之后,我们重新审视常规的多尺度特征空间(通常限制在P5上),并将其扩展到更丰富的空间,直至P9,其中最小的特征仅是输入尺寸的1/512,因此具有很大的接收场。为了处理如此丰富的功能空间,我们利用最近的BIFPN融合了多尺度功能。基于这些见解,我们开发了一个简化的分割模型,名为ESEG,该模型既没有高内部分辨率也不具有昂贵的严重卷积。也许令人惊讶的是,与多个数据集相比,我们的简单方法可以以更快的速度实现更高的准确性。在实时设置中,ESEG-Lite-S在189 fps的CityScapes [12]上达到76.0%,超过了更快的[9](73.1%MIOU时为170 fps)。我们的ESEG-LITE-L以79 fps的速度运行,达到80.1%MIOU,在很大程度上缩小了实时和高性能分割模型之间的差距。
It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel often depend on both nearby neighbors and far-away context; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we revisit the conventional multi-scale feature space (typically capped at P5) and extend it to a much richer space, up to P9, where the smallest features are only 1/512 of the input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPN to fuse the multi-scale features. Based on these insights, we develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions. Perhaps surprisingly, our simple method can achieve better accuracy with faster speed than prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170 FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high-performance segmentation models.