rlip：人类对象互动检测的关系语言图像预训练

论文标题

rlip：人类对象互动检测的关系语言图像预训练

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

论文作者

Yuan, Hangjie, Jiang, Jianwen, Albanie, Samuel, Feng, Tao, Huang, Ziyuan, Ni, Dong, Tang, Mingqian

论文摘要

人类对象相互作用（HOI）检测的任务目标是人类与环境相互作用的细粒度视觉解析，从而实现了广泛的应用。先前的工作证明了有效的体系结构设计和相关提示的整合，以进行更准确的HOI检测。但是，对于此任务的适当培训策略的设计仍然没有被现有方法所忽视。为了解决这一差距，我们提出了关系语言图像预训练（RLIP），这是一种利用实体和关系描述的对比预训练的策略。为了有效利用此类预训练，我们做出了三个技术贡献：（1）一种新的并行实体检测和顺序关系推理（Parse）体系结构，可以在整体优化的预训练期间使用实体和关系描述；（2）合成数据生成框架，标签序列扩展，扩展了每个Minibatch中可用的语言数据的规模；（3）解决歧义，关系质量标签和关系伪标签的机制，以减轻训练数据中模棱两可/嘈杂样本的影响。通过广泛的实验，我们证明了这些贡献的好处，共同称为rlip-parse，以改善零射击，很少射击和微调的HOI检测性能，以及从噪音注释中学习的鲁棒性。代码将在https://github.com/jacobyuan7/rlip上找到。

The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a synthetic data generation framework, Label Sequence Extension, that expands the scale of language data available within each minibatch; (3) mechanisms to account for ambiguity, Relation Quality Labels and Relation Pseudo-Labels, to mitigate the influence of ambiguous/noisy samples in the pre-training data. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. Code will be available at https://github.com/JacobYuan7/RLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题