论文标题
具有轨迹计划和功能线索的视觉和语言导航的优先级图
A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues
论文作者
论文摘要
在一条繁忙的城市街道上,如果这与他们的路线相关,被分心的行人可以挑出一个标志。户外视觉和语言导航(VLN)中的人工代理也面临着检测到环境特征和输入中位置的监督信号。为了提高基于变压器的体系结构中相关特征的突出性而无需付出昂贵的预处理和预处理,我们从优先地图中获得了灵感,这是神经心理学研究中描述的一种机制。我们使用具有高样本数据集的辅助任务实现了新颖的优先级图模块,并使用具有高水平表示路线和与环境相关的城市特征的参考的辅助任务进行了预处理。轨迹计划的层次结构过程 - 随后在视觉输入上进行了参数化的视觉增强过滤,并预测相应的文本跨度 - 解决了跨模式比对和特征级定位的核心挑战。优先地图模块集成到一个功能地点框架中,该框架将独立变压器的任务完成率翻了一番,并在VLN的达阵基准上获得最先进的性能。代码和数据在附录C中引用。
In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based architectures without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.